First Steps with Data Access
In this post we show how to perform basic data operations with Data Access.
We'll use Amazon products data from Kaggle, which you can download here.
The original CSV file was renamed to products.csv
for the sake of simplicity.
Exploratory Data Analysis (EDA) and transformations will be carried out to showcase some basic features of Data Access.
Firstly, we import the necessary module for Data Access:
from shapelets.data import sandbox
Then we read the data contained in the CSV file and check its schema:
# Read data and get schema
playground = sandbox()
amazon_data = playground.from_csv(
rel_name='amazon_products',
paths=['products.csv'],
delimiter=','
)
print(amazon_data)
Schema
├── Uniq Id - String[*]
├── Product Name - String[*]
├── Brand Name - String[*]
├── Asin - String[*]
├── Category - String[*]
├── Upc Ean Code - String[*]
├── List Price - String[*]
├── Selling Price - String[*]
├── Quantity - String[*]
├── Model Number - String[*]
├── About Product - String[*]
...
└── Product Description - String[*]
We obtain the variables names and their datatype. This is what happened:
from_csv
generates a Relation
object, which is the fundamental object for managing data.
Note that rel_name
refers to the name of the Relation
object we want to create,
which will be relevant to perform SQL queries.
Think of rel_name
as the table name from which we'll retrieve data.
paths
is a list of files CSV files to be read and delimiter
is used to specify
how the data is separated.
Check the tutorials to get an in-depth intuition of each object here.
It is easy to check how many observations / samples there are in the dataset:
# Get total number of products
n_products = amazon_data.row_count()
print(f'There are {n_products} total products')
There are 10002 total products
row_count()
is a method of Relation
objects that outputs the total number of
rows of each relation.
Another useful method is sample()
, which generates a random sample of the data to get
some initial insights about its structure:
# Draw a sample of product names
product_sample = amazon_data.sample(method=1, amount=0.04, units=0, seed=100) \
.with_columns(columns=[f'"Product Name"'])
print(product_sample)
Schema
└── Product Name - String[*]
This is not particularly intuitive, is it? We can activate the interactive mode of the sanbox environment to retrieve human-readable outputs:
# Human readable sample output
playground.interactive = True
print(product_sample)
Output
Product Name |
---|
Chill Sack Bean Bag Chair: Large 4' Memory Foam Furniture Bag and Large Lounger - Big Sofa with Soft Micro |
Fiber Cover - Royal Blue |
ParkZone 18A Brushless ESC Vehicle Part |
RAVE Sports 00002 Dock Slide |
Losi Rear Bumper Pack: TEN-SCTE, LOSB2417 |
Notice how we activated the interactive mode of our sandbox environment and got better looking results.
We can retrieve the number of distinct products and descriptions with the distinct()
method:
# Get number of distinct products
n_distinct = amazon_data.distinct(columns=f'"Product Name"').row_count()
print(f'There are {n_distinct} distinct products')
# Get number of distinct descriptions
n_distinct_descs = amazon_data.distinct(columns=f'"About Product"').row_count()
print(f'There are {n_distinct_descs} distinct descriptions')
There are 9930 distinct products
There are 9037 distinct descriptions
It's easy to chain method calls to produce the output above.
A SQL query can reproduce the same result:
# Replicate results with SQL
distinct_query = """
SELECT
COUNT(DISTINCT "Product Name") AS n_distinct_products,
COUNT(DISTINCT "About Product") AS n_distinct_descriptions
FROM
amazon_products
"""
query_results = playground.execute(query=distinct_query)
print(query_results)
<shapelets.native.__shapelets_core_ext.RecordSet object at 0x7f92e8edd1f0>
The query result is a RecordSet
object, from which we can extract the data in
several formats:
# Python List
print(query_results.to_list())
[{'n_distinct_products': 9930, 'n_distinct_descriptions': 9036}]
# Python Dictionary
print(query_results.to_dict())
{'n_distinct_products': [9930], 'n_distinct_descriptions': [9036]}
# Pandas DataFrame
print(query_results.to_pandas())
n_distinct_products n_distinct_descriptions
0 9930 9036
# Polars DataFrame
print(query_results.to_polars())
Output
n_distinct_products | n_distinct_descriptions |
---|---|
9930 | 9036 |
We obtained 9036
rather than 9037
distinct descriptions as distinct()
does not rule out empty values.
Since there are missing descriptions, let's find out how many of them we have to filter out:
# Retrieve count of products with no description
products_without_desc = """
SELECT
COUNT("Product Name")
FROM
amazon_products
WHERE
"About Product" IS NULL
"""
print(playground.execute(query=products_without_desc).to_list())
[{'count("Product Name")': 273}]
We proceed to eliminate them from our dataset:
# Filter empty products
filter_query = """
SELECT
"Product Name" AS product,
"About Product" AS description
FROM
amazon_products
WHERE
"About Product" IS NOT NULL
"""
amazon_final_data = playground.from_sql(
query=filter_query
)
# Display two products
print(amazon_final_data.limit(rows=1))
Output
Product | Description |
---|---|
Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5) | Make sure this fits by entering your model number. Snap circuits mini kits classpack provides basic electronic circuitry activities for students in grades 2-6. Includes 5 separate mini building kits- an FM radio, a motion detector, music box, space battle sound effects, and a flying saucer. Each kit includes separate components and instructions to build. Each component represents one function in a circuit; components snap together to create working models of everyday electronic devices. Activity guide provides additional projects to teach students how circuitry works. |
Notice how we created a new Relation
object with from_sql()
.
A SQL query can produce a new dataset to work with.