Skip to main content

Read and querying data

In this tutorial we'll cover how to process parquet files for Data Analysis.

We'll use the 2024 New York City Yellox Taxi Trip Records dataset.

The resulting file is called yellow_tripdata_2024-01.parquet, but we don't need to write it manually!

Import the sandbox object and load data into it

from shapelets.data import sandbox

playground = sandbox()
taxis = playground.from_parquet(
rel_name="taxis",
paths=["yellow*2024*.parquet"]
)
print(taxis)
Output
Schema
├── VendorID - Int32
├── tpep_pickup_datetime - Timestamp
├── tpep_dropoff_datetime - Timestamp
├── passenger_count - Int64
├── trip_distance - Float64
├── RatecodeID - Int64
├── store_and_fwd_flag - String[*]
├── PULocationID - Int32
├── DOLocationID - Int32
├── payment_type - Int64
├── fare_amount - Float64
├── extra - Float64
├── mta_tax - Float64
├── tip_amount - Float64
├── tolls_amount - Float64
├── improvement_surcharge - Float64
├── total_amount - Float64
├── congestion_surcharge - Float64
└── Airport_fee - Float64

Let's check how many rows there are in the dataset:

print(f"There are {taxis.row_count()} observations")
Output
There are 2964624 observations

Execute queries

We can now query the data and perform calculations.

# Get average passengers by day and hour
avg_passengers = """
SELECT
AVG(passenger_count),
EXTRACT('day' FROM tpep_dropoff_datetime) AS time_day,
EXTRACT('hour' FROM tpep_dropoff_datetime) AS time_hour
FROM
taxis
GROUP BY
time_day,
time_hour
"""
result = playground.from_sql(query=avg_passengers).execute()