Inspect a DataSet#

In this section we will look at the different ways in which you can visualize the contents of a DataSet.

For this section, we’ll start by loading a well-known dataset, the iris dataset. We will do this using the load_test_data() function.

>>> import shapelets as sh
>>> session = sh.sandbox()
>>> data = session.load_test_data()

Note

Remember create Shapelets session first to work with Shapelets API.

Contents of a DataSet#

If you have a Shapelets Dataset, you can visualize the contents using.

  • head(n=5) function to see the n top first rows.

>>> data.head()
   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width        Class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
  • tail(n=5) function to see the n last rows. Beware that this specific method will cause the DataSet to materialize and could take some extra time to run, depending on DataSet size.

>>> data.tail()
   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width           Class
0           6.7          3.0           5.2          2.3  Iris-virginica
1           6.3          2.5           5.0          1.9  Iris-virginica
2           6.5          3.0           5.2          2.0  Iris-virginica
3           6.2          3.4           5.4          2.3  Iris-virginica
4           5.9          3.0           5.1          1.8  Iris-virginica

If you want to get a sample of a DataSet, you can do it calling sample() function

>>> data.sample()

DataSet description#

If you want to know the number of rows in a DataSet you can use Python’s len() function to find it out.

>>> len(data)
150

If you want to access the names of the columns in a DataSet you can use the attribute columns of a DataSet.

>>> data.columns
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Class']

If you want to know the shape of a DataSet you can use the shape attribute.

>>> data.shape()
(150, 5)

You can check the structure of DataSet, getting info about columns and datatypes by calling the DataSet object.

>>> data
Column        NumPy Type    SQL Type
Sepal_Length  float64       DOUBLE
Sepal_Width   float64       DOUBLE
Petal_Length  float64       DOUBLE
Petal_Width   float64       DOUBLE
Class         object        VARCHAR

Summaries#

If you want a statistical summary of the DataSet, you can get it calling describe():

>>> data.describe()
    column_name column_type          min             max approx_unique                 avg                  std                 q25   q50                 q75  count null_percentage
0  Sepal_Length      DOUBLE          4.3             7.9            35   5.843333333333335   0.8280661279778637                 5.1   5.8                 6.4    150            0.0%
1   Sepal_Width      DOUBLE          2.0             4.4            23  3.0540000000000007  0.43359431136217375                 2.8   3.0  3.3124999999999996    150            0.0%
2  Petal_Length      DOUBLE          1.0             6.9            41  3.7586666666666693    1.764420419952262  1.5750000000000002  4.35                 5.1    150            0.0%
3   Petal_Width      DOUBLE          0.1             2.5            22  1.1986666666666672   0.7631607417008414                 0.3   1.3                 1.8    150            0.0%
4         Class     VARCHAR  Iris-setosa  Iris-virginica             3                 NaN                  NaN                 NaN   NaN                 NaN    150            0.0%