Read and querying data

In this tutorial we'll cover how to process parquet files for Data Analysis.

We'll use the 2024 New York City Yellox Taxi Trip Records dataset.

The resulting file is called yellow_tripdata_2024-01.parquet, but we don't need to write it manually!

Import the sandbox object and load data into it

from shapelets.data import sandbox

playground = sandbox()
taxis = playground.from_parquet(
    rel_name="taxis",
    paths=["yellow*2024*.parquet"]
)
print(taxis)

Output
Schema
├── VendorID - Int32
├── tpep_pickup_datetime - Timestamp
├── tpep_dropoff_datetime - Timestamp
├── passenger_count - Int64
├── trip_distance - Float64
├── RatecodeID - Int64
├── store_and_fwd_flag - String[*]
├── PULocationID - Int32
├── DOLocationID - Int32
├── payment_type - Int64
├── fare_amount - Float64
├── extra - Float64
├── mta_tax - Float64
├── tip_amount - Float64
├── tolls_amount - Float64
├── improvement_surcharge - Float64
├── total_amount - Float64
├── congestion_surcharge - Float64
└── Airport_fee - Float64

Let's check how many rows there are in the dataset:

print(f"There are {taxis.row_count()} observations")

Output

There are 2964624 observations

Execute queries

We can now query the data and perform calculations.

# Get average passengers by day and hour
avg_passengers = """
SELECT
    AVG(passenger_count),
    EXTRACT('day' FROM tpep_dropoff_datetime) AS time_day,
    EXTRACT('hour' FROM tpep_dropoff_datetime) AS time_hour
FROM
    taxis
GROUP BY
    time_day,
    time_hour
"""
result = playground.from_sql(query=avg_passengers).execute()

Read and querying data

Import the sandbox object and load data into it​

Execute queries​

Import the sandbox object and load data into it

Execute queries