Read and querying data
In this tutorial we'll cover how to process parquet files for Data Analysis.
We'll use the 2024 New York City Yellox Taxi Trip Records dataset.
The resulting file is called yellow_tripdata_2024-01.parquet
, but we don't need to write it manually!
Import the sandbox object and load data into it
from shapelets.data import sandbox
playground = sandbox()
taxis = playground.from_parquet(
rel_name="taxis",
paths=["yellow*2024*.parquet"]
)
print(taxis)
Output
Schema
├── VendorID - Int32
├── tpep_pickup_datetime - Timestamp
├── tpep_dropoff_datetime - Timestamp
├── passenger_count - Int64
├── trip_distance - Float64
├── RatecodeID - Int64
├── store_and_fwd_flag - String[*]
├── PULocationID - Int32
├── DOLocationID - Int32
├── payment_type - Int64
├── fare_amount - Float64
├── extra - Float64
├── mta_tax - Float64
├── tip_amount - Float64
├── tolls_amount - Float64
├── improvement_surcharge - Float64
├── total_amount - Float64
├── congestion_surcharge - Float64
└── Airport_fee - Float64
Let's check how many rows there are in the dataset:
print(f"There are {taxis.row_count()} observations")
Output
There are 2964624 observations
Execute queries
We can now query the data and perform calculations.
# Get average passengers by day and hour
avg_passengers = """
SELECT
AVG(passenger_count),
EXTRACT('day' FROM tpep_dropoff_datetime) AS time_day,
EXTRACT('hour' FROM tpep_dropoff_datetime) AS time_hour
FROM
taxis
GROUP BY
time_day,
time_hour
"""
result = playground.from_sql(query=avg_passengers).execute()