Skip to main content

Converting to different formats

In this tutorial we'll cover how to convert files to different formats.

We'll use the 2024 New York City Yellox Taxi Trip Records dataset.

As we know from the previous tutorial, the data is contained within a parquet file.

We'll convert it to CSV.

Import the sandbox object and load data into it

from shapelets.data import sandbox

playground = sandbox()
taxis_parquet = playground.from_parquet(
rel_name="taxis",
paths=["yellow*2024*.parquet"]
)

Convert to CSV

We'll generate a CSV file called taxis_2024.csv

taxis_parquet.to_csv(base_path="taxis_2024.csv")

We could read the data again from it and verify that we have the same information:

# Read data from CSV
taxis_csv = playground.from_csv(
rel_name="taxis_csv",
paths=["taxis*.csv"]
)

# Check number of observations
print(f"There are {taxis_csv.row_count()} observations in the CSV file")
print(f"There are {taxis_parquet.row_count()} observations in the parquet file")
Output
There are 2964624 observations in the CSV file
There are 2964624 observations in the parquet file

Convert to other formats

We can also convert RecordSet objects to data structures such as polars DataFrames. Let us take the query from the previous tutorial as an example:

result.to_polars()

Output

avg(passenger_count)time_daytime_hour
1.53903110
1.58056311
1.55794912
1.51232113
1.48346814
.........
1.2450923119
1.2509173120
1.2769413121
1.3092763122
1.2850143123