Have I seen this before?

Querying a sequence to find the closest match is a fundamental operation in time series pattern mining. This problem presents itself in range of different contexts:

  1. Find the Nearest Neighbor to a query or sample series.

  2. Range queries: Find all time series in a collection whose distance is less than a threshold

  3. Density estimation: Count the number of time series in a set where the distance to a query is less than a threshold.

Good techniques for searching have been around for a while. For example in 1, one finds scalable techniques to resolve this problem under both DTW and Euclidean distances, based on heuristic techniques like early abandoning among others. This solution, whilst offering good amortized execution times, early abandoning techniques performance depend significantly on the nature of the data set itself, keeping the worse case scenario \(O(nm)\), where n is the length of the time series and m the length of the query.

Shapelets-Compute implements MASS (2), which is a convolution based method that reduces the execution costs to a fixed \(O(n\mathrm{log}(n))\). It is the basic concept behind matrix profile as it produces a distance profile, that is, a distance vector between the query to every subsequence of the time series.

To drive the tutorial, let’s use an well-known example based on the accelerometer series generated by dog robot walking through concrete, a carpet and finally back to the carpet again:

>>> import shapelets.compute as sc
>>> from shapelets.data import load_dataset
>>> robot = load_dataset('robot_dog')
>>> robot.shape
(13000, 1)
../_images/tutorial_searching-1.png

Next, load a query, which relates to a non existing sample of the same robot walking through a carpet:

>>> query = load_dataset('carpet_query')
>>> query.shape
(100, 1)

We should be able now to run MASS algorithm:

>>> robot = sc.normalization.zscore(robot)
>>> query = sc.normalization.zscore(query)
>>> r = sc.matrixprofile.mass(query, robot)
>>> r.shape
(12901,1)

The result of executing mass is a distance profile, a vector containing the distance between the query and every single moving window of size query. There are 12901 points in the output, as 13000-100+1 gives the exact amount of windows processed.

To look for the closest match, we simply need to find the minimum distance in the distance profile:

>>> index, min_dst = sc.argmin(r)
>>> index
7479
../_images/search_a.png

Quite importantly, the same computation has produced the distance to every possible subsequence, hence answering K-NN is a straight forward task:

>>> indices, values = sc.sort_index(r)
>>> ten_closest = indices[:10]
../_images/search_b.png

Finally, filtering the indices whose distance is within 10% threshold of the minimum distance, produces the basis out of which we can compute a density estimation:

>>> q = np.quantile(values, 0.005)
>>> selected_indices = indices[values <= q]
../_images/search_c.png

As we can clearly see graphically, the region with the highest density corresponds to the center region, which is consistent type of query run.

Batching queries

Shapelets’ implementation of MASS supports batching, that is, in the same operation multiple queries can be run against multiple sequences, maximizing the level of parallelism found in your CUDA or OpenCL device.

To run this mode, simply stack multiple queries, column wise, against multiple sequences (column wise) and run the operation. As an example, let’s extract 10 random subsequences of size 100 from the robot sequence and run them against the very same sequence:

>>> l = 100
>>> queries = 10
>>> max_index = robot.shape[0] - l
>>> queries = sc.empty((l,queries), dtype = robot.dtype)
>>> for i in range(10):
...     x = np.random.randint(0, max_index)
...     queries[:,i] = robot[i:i+l]
>>> multiple_results = sc.matrixprofile.mass(queries, robot)
>>> multiple_results.shape
(12901, 10)

When running in batch mode, the results will have as many rows as moving windows (same as in the 1:1 case), but there will be a new column per query; similarly, if we had multiple series, there will be one additional dimension. In general, if the operation has q queries of length m, over s series of length n, the result will be a three dimensional array with dimensions (n-m+1, q, s).

References

1
Searching and mining trillions of time series subsequences under dynamic time warping.
Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh.
KDD 2012: 262-270
2
The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance.
Mueen, Abdullah and Zhu, Yan and Yeh, Michael and Kamgar, Kaveh and Viswanathan, Krishnamurthy and Gupta, Chetan and Keogh, Eamonn.
August 2017