Skip to main content

First Steps with Data Access

In this post we show how to perform basic data operations with Data Access.

We'll use Amazon products data from Kaggle, which you can download here.

The original CSV file was renamed to products.csv for the sake of simplicity.

Exploratory Data Analysis (EDA) and transformations will be carried out to showcase some basic features of Data Access.

Firstly, we import the necessary module for Data Access:

from shapelets.data import sandbox

Then we read the data contained in the CSV file and check its schema:

# Read data and get schema
playground = sandbox()
amazon_data = playground.from_csv(
rel_name='amazon_products',
paths=['products.csv'],
delimiter=','
)
print(amazon_data)
Output
Schema
├── Uniq Id - String[*]
├── Product Name - String[*]
├── Brand Name - String[*]
├── Asin - String[*]
├── Category - String[*]
├── Upc Ean Code - String[*]
├── List Price - String[*]
├── Selling Price - String[*]
├── Quantity - String[*]
├── Model Number - String[*]
├── About Product - String[*]
...
└── Product Description - String[*]

We obtain the variables names and their datatype. This is what happened:

from_csv generates a Relation object, which is the fundamental object for managing data.

Note that rel_name refers to the name of the Relation object we want to create, which will be relevant to perform SQL queries. Think of rel_name as the table name from which we'll retrieve data.

paths is a list of files CSV files to be read and delimiter is used to specify how the data is separated.

info

Check the tutorials to get an in-depth intuition of each object here.

It is easy to check how many observations / samples there are in the dataset:

# Get total number of products
n_products = amazon_data.row_count()
print(f'There are {n_products} total products')
There are 10002 total products

row_count() is a method of Relation objects that outputs the total number of rows of each relation.

Another useful method is sample(), which generates a random sample of the data to get some initial insights about its structure:

# Draw a sample of product names
product_sample = amazon_data.sample(method=1, amount=0.04, units=0, seed=100) \
.with_columns(columns=[f'"Product Name"'])
print(product_sample)
Output
Schema
└── Product Name - String[*]

This is not particularly intuitive, is it? We can activate the interactive mode of the sanbox environment to retrieve human-readable outputs:

# Human readable sample output
playground.interactive = True
print(product_sample)

Output

Product Name
Chill Sack Bean Bag Chair: Large 4' Memory Foam Furniture Bag and Large Lounger - Big Sofa with Soft Micro
Fiber Cover - Royal Blue
ParkZone 18A Brushless ESC Vehicle Part
RAVE Sports 00002 Dock Slide
Losi Rear Bumper Pack: TEN-SCTE, LOSB2417

Notice how we activated the interactive mode of our sandbox environment and got better looking results.

We can retrieve the number of distinct products and descriptions with the distinct() method:

# Get number of distinct products
n_distinct = amazon_data.distinct(columns=f'"Product Name"').row_count()
print(f'There are {n_distinct} distinct products')

# Get number of distinct descriptions
n_distinct_descs = amazon_data.distinct(columns=f'"About Product"').row_count()
print(f'There are {n_distinct_descs} distinct descriptions')
Output
There are 9930 distinct products
There are 9037 distinct descriptions

It's easy to chain method calls to produce the output above.

A SQL query can reproduce the same result:

# Replicate results with SQL
distinct_query = """
SELECT
COUNT(DISTINCT "Product Name") AS n_distinct_products,
COUNT(DISTINCT "About Product") AS n_distinct_descriptions
FROM
amazon_products
"""
query_results = playground.execute(query=distinct_query)
print(query_results)
<shapelets.native.__shapelets_core_ext.RecordSet object at 0x7f92e8edd1f0>

The query result is a RecordSet object, from which we can extract the data in several formats:

# Python List
print(query_results.to_list())
Output
[{'n_distinct_products': 9930, 'n_distinct_descriptions': 9036}]
# Python Dictionary
print(query_results.to_dict())
Output
{'n_distinct_products': [9930], 'n_distinct_descriptions': [9036]}
# Pandas DataFrame
print(query_results.to_pandas())
Output
   n_distinct_products  n_distinct_descriptions
0 9930 9036
# Polars DataFrame
print(query_results.to_polars())

Output

n_distinct_productsn_distinct_descriptions
99309036
note

We obtained 9036 rather than 9037 distinct descriptions as distinct() does not rule out empty values.

Since there are missing descriptions, let's find out how many of them we have to filter out:

# Retrieve count of products with no description
products_without_desc = """
SELECT
COUNT("Product Name")
FROM
amazon_products
WHERE
"About Product" IS NULL
"""
print(playground.execute(query=products_without_desc).to_list())
Output
[{'count("Product Name")': 273}]

We proceed to eliminate them from our dataset:

# Filter empty products
filter_query = """
SELECT
"Product Name" AS product,
"About Product" AS description
FROM
amazon_products
WHERE
"About Product" IS NOT NULL
"""
amazon_final_data = playground.from_sql(
query=filter_query
)

# Display two products
print(amazon_final_data.limit(rows=1))

Output

ProductDescription
Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)

Make sure this fits by entering your model number. Snap circuits mini kits classpack provides basic electronic circuitry activities for students in grades 2-6.


Includes 5 separate mini building kits- an FM radio, a motion detector, music box, space battle sound effects, and a flying saucer.


Each kit includes separate components and instructions to build.


Each component represents one function in a circuit; components snap together to create working models of everyday electronic devices.


Activity guide provides additional projects to teach students how circuitry works.

Notice how we created a new Relation object with from_sql(). A SQL query can produce a new dataset to work with.