Good beans, bad beans¶
Consider the following set of sequences produced by a spectrograph:
These 28 line plots correspond to the signatures generated by two different classes of coffee beans: Arabica and Robusta. These two coffee bean varieties are remarkably similar under their signatures but, still, they show different physical features, palate and, most importantly, huge price differences:
Paler, circular, smaller
25% more caffeine
Oval, generally larger
3 times more expensive
Lower caffeine content
To create a model to discriminate between the two varieties it is not a trivial problem since there almost no obvious significant features.
To load this training set, use the
>>> import shapelets.compute as sc >>> from shapelets.data import load_dataset >>> all_data = load_dataset('coffee_train') >>> ds = all_data[:,1:].T >>> labels = all_data[:,0].astype("uint32") >>> ds.shape (286, 28) >>> labels.shape (28, 1)
ds contains all series as columns and
labels identifies each column series in
ds with a concrete coffee class.
The first 14 columns represent one type of coffee (indices from 0 to 13) and the rest of columns have been tagged with
the other class. Data in
ds has been already z-normalized.
When dealing with classification problems in time series, it usually makes sense to start with a dendogram to explore
the distances relationships between series. Using
euclidean distance is a good start, since it maps naturally to
a possible usage of K-Means as classification algorithm for this problem.
import matplotlib.pyplot as plt import shapelets.compute as sc from shapelets.data import load_dataset all_data = load_dataset('coffee_train') ds = all_data[:,1:].T plt.style.use('seaborn-whitegrid') plt.figure(figsize=(16,8)) sc.distances.dendogram_all_ts(ds, 'single', 'euclidean', ts_hspace=5) plt.tight_layout()
dendogram_all_ts function is based on a great dendogram library called
released under MIT license and available through
pip install dendogram-ts.
Our version is based on that code base, but optimized to work with precomputed linkage data and avoids using Pandas.
When series are so close to each other from a distance point of view, the performance of a K-Means like algorithm is going to be determined by its ability to compute new centroids to maximize the possible differences between series.
k-Shape 1 is a clustering algorithm specialized in time series that relies on a iterative refinement algorithm, creating homogeneous and well-separated clusters. As distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series. k-Shape incorporates a method to compute and refine cluster centroids on every iteration of the algorithm, creating new centroids that separate the groups more efficiently than other methods.
The distance metric in k-Shape is called
sbd, and it can be run directly in Shapelets to compute a
distance matrix; using this distance, the previous dendogram based on euclidean distance gets transformed
into this one:
import matplotlib.pyplot as plt import shapelets.compute as sc from shapelets.data import load_dataset all_data = load_dataset('coffee_train') ds = all_data[:,1:].T plt.style.use('seaborn-whitegrid') plt.figure(figsize=(16,8)) sc.distances.dendogram_all_ts(ds, 'single', 'sbd', ts_hspace=5) plt.tight_layout()
Despite the fact distances are numerically different, there are not real differences between this the implied groups in the dendogram. This is going to be critical for this scenario since the performance of k-Shape over other algorithms, like K-Means, is going to be influenced by its ability to improve centroids to maximize the separation between the target clusters.
To start training a classification model based on k-Shape,
>>> ks = sc.clustering.KShape(2, rnd_labels=False) >>> ks.fit(ds, labels)
ks can be used to plot and query the generated centroids:
import matplotlib.pyplot as plt import shapelets.compute as sc from shapelets.data import load_dataset all_data = load_dataset('coffee_train') ds = all_data[:,1:].T labels = all_data[:,0] ks = sc.clustering.KShape(2, rnd_labels=False) ks.fit(ds, labels) ks.plot_centroids()
With a trained model, we can now use it to predict; the dataset
coffee_test contains another set of coffee
sequences to validate the model:
>>> test_data = load_dataset('coffee_test') >>> ds_test = test_data[:,1:].T >>> labels_test = test_data[:,0].astype("uint32") >>> predictions = ks.predict(ds_test)
And finally, the model can be evaluated using prebuilt metrics from
sklearn, starting with a measure
of the accuracy during training:
>>> from sklearn import metrics >>> import numpy as np >>> metrics.accuracy_score(np.array(labels), np.array(ks.labels_)) 1.0
A 100% success during training is, at the very least, unsettling; it is possible to retrain the model again,
asking to randomize the labels setting the parameter
True. k-Shape is not free of
local minima issues and it may require running the algorithm multiple times to choose the correct configuration.
Let’s measure the prediction’s accuracy:
>>> metrics.accuracy_score(np.array(labels_test), np.array(predictions)) 0.9642857142857143
And a confusion matrix:
>>> metrics.confusion_matrix(np.array(labels_test), np.array(predictions)) array([[14, 1], [ 0, 13]])
As you can see, the results using k-Shape are quite satisfactory, specially considering how small the training set is and how subtle the differences between the two classes are.