Converting to different formats

In this tutorial we'll cover how to convert files to different formats.

We'll use the 2024 New York City Yellox Taxi Trip Records dataset.

As we know from the previous tutorial, the data is contained within a parquet file.

We'll convert it to CSV.

Import the sandbox object and load data into it

from shapelets.data import sandbox

playground = sandbox()
taxis_parquet = playground.from_parquet(
    rel_name="taxis",
    paths=["yellow*2024*.parquet"]
)

Convert to CSV

We'll generate a CSV file called taxis_2024.csv

taxis_parquet.to_csv(base_path="taxis_2024.csv")

We could read the data again from it and verify that we have the same information:

# Read data from CSV
taxis_csv = playground.from_csv(
    rel_name="taxis_csv",
    paths=["taxis*.csv"]
)

# Check number of observations
print(f"There are {taxis_csv.row_count()} observations in the CSV file")
print(f"There are {taxis_parquet.row_count()} observations in the parquet file")

Output

There are 2964624 observations in the CSV file
There are 2964624 observations in the parquet file

Convert to other formats

We can also convert RecordSet objects to data structures such as polars DataFrames. Let us take the query from the previous tutorial as an example:

result.to_polars()

Output

avg(passenger_count)	time_day	time_hour
1.539031	1	0
1.580563	1	1
1.557949	1	2
1.512321	1	3
1.483468	1	4
...	...	...
1.245092	31	19
1.250917	31	20
1.276941	31	21
1.309276	31	22
1.285014	31	23

Converting to different formats

Import the sandbox object and load data into it​

Convert to CSV​

Convert to other formats​

Output

Import the sandbox object and load data into it

Convert to CSV

Convert to other formats