Converting to different formats
In this tutorial we'll cover how to convert files to different formats.
We'll use the 2024 New York City Yellox Taxi Trip Records dataset.
As we know from the previous tutorial, the data is contained within a parquet file.
We'll convert it to CSV.
Import the sandbox object and load data into it
from shapelets.data import sandbox
playground = sandbox()
taxis_parquet = playground.from_parquet(
rel_name="taxis",
paths=["yellow*2024*.parquet"]
)
Convert to CSV
We'll generate a CSV file called taxis_2024.csv
taxis_parquet.to_csv(base_path="taxis_2024.csv")
We could read the data again from it and verify that we have the same information:
# Read data from CSV
taxis_csv = playground.from_csv(
rel_name="taxis_csv",
paths=["taxis*.csv"]
)
# Check number of observations
print(f"There are {taxis_csv.row_count()} observations in the CSV file")
print(f"There are {taxis_parquet.row_count()} observations in the parquet file")
Output
There are 2964624 observations in the CSV file
There are 2964624 observations in the parquet file
Convert to other formats
We can also convert RecordSet
objects to data structures such as polars DataFrames.
Let us take the query from the previous tutorial as an example:
result.to_polars()
Output
avg(passenger_count) | time_day | time_hour |
---|---|---|
1.539031 | 1 | 0 |
1.580563 | 1 | 1 |
1.557949 | 1 | 2 |
1.512321 | 1 | 3 |
1.483468 | 1 | 4 |
... | ... | ... |
1.245092 | 31 | 19 |
1.250917 | 31 | 20 |
1.276941 | 31 | 21 |
1.309276 | 31 | 22 |
1.285014 | 31 | 23 |