Good beans, bad beans

Consider the following set of sequences produced by a spectrograph:

../_images/tutorial_kshape-1.png

These 28 line plots correspond to the signatures generated by two different classes of coffee beans: Arabica and Robusta. These two coffee bean varieties are remarkably similar under their signatures but, still, they show different physical features, palate and, most importantly, huge price differences:

Variety

Palate

Physical Description

Price

Chemistry

Robusta

Woody, burnt

Paler, circular, smaller

Generally cheaper

25% more caffeine

Arabica

Sweet, acidity

Oval, generally larger

3 times more expensive

Lower caffeine content

To create a model to discriminate between the two varieties it is not a trivial problem since there almost no obvious significant features.

To load this training set, use the coffee_train dataset:

>>> import shapelets.compute as sc
>>> from shapelets.data import load_dataset
>>> all_data = load_dataset('coffee_train')
>>> ds = all_data[:,1:].T
>>> labels = all_data[:,0].astype("uint32")
>>> ds.shape
(286, 28)
>>> labels.shape
(28, 1)

ds contains all series as columns and labels identifies each column series in ds with a concrete coffee class. The first 14 columns represent one type of coffee (indices from 0 to 13) and the rest of columns have been tagged with the other class. Data in ds has been already z-normalized.

When dealing with classification problems in time series, it usually makes sense to start with a dendogram to explore the distances relationships between series. Using euclidean distance is a good start, since it maps naturally to a possible usage of K-Means as classification algorithm for this problem.

../_images/coffee_a.png

Note

dendogram_all_ts function is based on a great dendogram library called dendrogram-ts, released under MIT license and available through pip install dendogram-ts.

Our version is based on that code base, but optimized to work with precomputed linkage data and avoids using Pandas.

When series are so close to each other from a distance point of view, the performance of a K-Means like algorithm is going to be determined by its ability to compute new centroids to maximize the possible differences between series.

k-Shape 1 is a clustering algorithm specialized in time series that relies on a iterative refinement algorithm, creating homogeneous and well-separated clusters. As distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series. k-Shape incorporates a method to compute and refine cluster centroids on every iteration of the algorithm, creating new centroids that separate the groups more efficiently than other methods.

The distance metric in k-Shape is called sbd, and it can be run directly in Shapelets to compute a distance matrix; using this distance, the previous dendogram based on euclidean distance gets transformed into this one:

../_images/coffee_b.png

Despite the fact distances are numerically different, there are not real differences between this the implied groups in the dendogram. This is going to be critical for this scenario since the performance of k-Shape over other algorithms, like K-Means, is going to be influenced by its ability to improve centroids to maximize the separation between the target clusters.

To start training a classification model based on k-Shape,

>>> ks = sc.clustering.KShape(2, rnd_labels=False)
>>> ks.fit(ds, labels)

The instance ks can be used to plot and query the generated centroids:

>>> ks.plot_centroids()
../_images/coffee_c.png

With a trained model, we can now use it to predict; the dataset coffee_test contains another set of coffee sequences to validate the model:

>>> test_data = load_dataset('coffee_test')
>>> ds_test = test_data[:,1:].T
>>> labels_test = test_data[:,0].astype("uint32")
>>> predictions = ks.predict(ds_test)

And finally, the model can be evaluated using prebuilt metrics from sklearn, starting with a measure of the accuracy during training:

>>> from sklearn import metrics
>>> import numpy as np
>>> metrics.accuracy_score(np.array(labels), np.array(ks.labels_))
1.0

A 100% success during training is, at the very least, unsettling; it is possible to retrain the model again, asking to randomize the labels setting the parameter rnd_labels to True. k-Shape is not free of local minima issues and it may require running the algorithm multiple times to choose the correct configuration.

Let’s measure the prediction’s accuracy:

>>> metrics.accuracy_score(np.array(labels_test), np.array(predictions))
0.9642857142857143

And a confusion matrix:

>>> metrics.confusion_matrix(np.array(labels_test), np.array(predictions))
array([[14,  1],
       [ 0, 13]])

As you can see, the results using k-Shape are quite satisfactory, specially considering how small the training set is and how subtle the differences between the two classes are.

References

1
k-Shape: Efficient and Accurate Clustering of Time Series.
John Paparrizos and Luis Gravano. 2016.
SIGMOD Rec. 45, 1 (June 2016), 69-76.