Inspect a DataSet#
In this section we will look at the different ways in which you can visualize the contents of a DataSet.
For this section, we’ll start by loading a well-known dataset, the iris dataset. We will do this using the load_test_data()
function.
>>> import shapelets as sh
>>> session = sh.sandbox()
>>> data = session.load_test_data()
Note
Remember create Shapelets session first to work with Shapelets API.
Contents of a DataSet#
If you have a Shapelets Dataset, you can visualize the contents using.
head(n=5) function to see the n top first rows.
>>> data.head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
tail(n=5) function to see the n last rows. Beware that this specific method will cause the DataSet to materialize and could take some extra time to run, depending on DataSet size.
>>> data.tail()
Sepal_Length Sepal_Width Petal_Length Petal_Width Class
0 6.7 3.0 5.2 2.3 Iris-virginica
1 6.3 2.5 5.0 1.9 Iris-virginica
2 6.5 3.0 5.2 2.0 Iris-virginica
3 6.2 3.4 5.4 2.3 Iris-virginica
4 5.9 3.0 5.1 1.8 Iris-virginica
If you want to get a sample of a DataSet, you can do it calling sample() function
>>> data.sample()
DataSet description#
If you want to know the number of rows in a DataSet you can use Python’s len() function to find it out.
>>> len(data)
150
If you want to access the names of the columns in a DataSet you can use the attribute columns of a DataSet.
>>> data.columns
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Class']
If you want to know the shape of a DataSet you can use the shape attribute.
>>> data.shape()
(150, 5)
You can check the structure of DataSet, getting info about columns and datatypes by calling the DataSet object.
>>> data
Column NumPy Type SQL Type
Sepal_Length float64 DOUBLE
Sepal_Width float64 DOUBLE
Petal_Length float64 DOUBLE
Petal_Width float64 DOUBLE
Class object VARCHAR
Summaries#
If you want a statistical summary of the DataSet, you can get it calling describe()
:
>>> data.describe()
column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
0 Sepal_Length DOUBLE 4.3 7.9 35 5.843333333333335 0.8280661279778637 5.1 5.8 6.4 150 0.0%
1 Sepal_Width DOUBLE 2.0 4.4 23 3.0540000000000007 0.43359431136217375 2.8 3.0 3.3124999999999996 150 0.0%
2 Petal_Length DOUBLE 1.0 6.9 41 3.7586666666666693 1.764420419952262 1.5750000000000002 4.35 5.1 150 0.0%
3 Petal_Width DOUBLE 0.1 2.5 22 1.1986666666666672 0.7631607417008414 0.3 1.3 1.8 150 0.0%
4 Class VARCHAR Iris-setosa Iris-virginica 3 NaN NaN NaN NaN NaN 150 0.0%