# Have I seen this before?¶

Querying a sequence to find the closest match is a fundamental operation in time series pattern mining. This problem presents itself in range of different contexts:

Find the Nearest Neighbor to a query or sample series.

Range queries: Find all time series in a collection whose distance is less than a threshold

Density estimation: Count the number of time series in a set where the distance to a query is less than a threshold.

Good techniques for searching have been around for a while. For example in 1, one finds scalable techniques to resolve
this problem under both DTW and Euclidean distances, based on heuristic techniques like early abandoning among others. This
solution, whilst offering good amortized execution times, early abandoning techniques performance depend
significantly on the nature of the data set itself, keeping the worse case scenario \(O(nm)\), where `n`

is the
length of the time series and `m`

the length of the query.

Shapelets-Compute implements MASS (2), which is a convolution based method that reduces the execution costs to a
fixed \(O(n\mathrm{log}(n))\). It is the basic concept behind matrix profile as it produces a distance profile,
that is, a distance vector between the query to *every subsequence of the time series*.

To drive the tutorial, let’s use an well-known example based on the accelerometer series generated by dog robot walking through concrete, a carpet and finally back to the carpet again:

```
>>> import shapelets.compute as sc
>>> from shapelets.data import load_dataset
>>> robot = load_dataset('robot_dog')
>>> robot.shape
(13000, 1)
```

Next, load a query, which relates to a non existing sample of the same robot walking through a carpet:

```
>>> query = load_dataset('carpet_query')
>>> query.shape
(100, 1)
```

We should be able now to run `MASS`

algorithm:

```
>>> robot = sc.normalization.zscore(robot)
>>> query = sc.normalization.zscore(query)
>>> r = sc.matrixprofile.mass(query, robot)
>>> r.shape
(12901,1)
```

The result of executing `mass`

is a distance profile, a vector containing the distance between the query and
every single moving window of size *query*. There are `12901`

points in the output, as `13000-100+1`

gives the
exact amount of windows processed.

To look for the **closest match**, we simply need to find the minimum distance in the distance profile:

```
>>> index, min_dst = sc.argmin(r)
>>> index
7479
```

```
import shapelets.compute as sc
import matplotlib.pyplot as plt
from shapelets.data import load_dataset
robot = sc.normalization.zscore(load_dataset('robot_dog'))
query = sc.normalization.zscore(load_dataset('carpet_query'))
r = sc.matrixprofile.mass(query, robot)
index, _ = sc.argmin(r)
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(range(index-50, index+150), robot[index-50:index+150], label='series')
ax.plot(range(index, index+100), query, label='query')
ax.axvline(x=index, linestyle='--', color = 'k', alpha=0.4)
plt.legend()
plt.show()
```

Quite importantly, the same computation has produced the distance to every possible subsequence, hence answering **K-NN** is
a straight forward task:

```
>>> indices, values = sc.sort_index(r)
>>> ten_closest = indices[:10]
```

```
import shapelets.compute as sc
import matplotlib.pyplot as plt
import numpy as np
from shapelets.data import load_dataset
robot = sc.normalization.zscore(load_dataset('robot_dog'))
query = sc.normalization.zscore(load_dataset('carpet_query'))
r = sc.matrixprofile.mass(query, robot)
indices, values = sc.sort_index(r)
fig, ax = plt.subplots(figsize=(16, 8))
for i in np.array(indices[:10]):
ax.plot(robot[i:i+100], color='r', alpha=0.1)
ax.plot(query)
plt.show()
```

Finally, filtering the indices whose distance is within 10% threshold of the minimum distance, produces the
basis out of which we can compute a **density estimation**:

```
>>> q = np.quantile(values, 0.005)
>>> selected_indices = indices[values <= q]
```

```
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import shapelets.compute as sc
from shapelets.data import load_dataset
robot = sc.normalization.zscore(load_dataset('robot_dog'))
query = sc.normalization.zscore(load_dataset('carpet_query'))
r = sc.matrixprofile.mass(query, robot)
index, min_dst = sc.argmin(r)
indices, values = sc.sort_index(r)
q = np.quantile(values, 0.005)
selected_indices = indices[values <= q]
mean = sc.mean(selected_indices)
std = sc.std(selected_indices) * 1.5
fig, ax = plt.subplots(2,1,figsize=(16, 8),gridspec_kw={'height_ratios': [5, 1]})
ax[0].plot(robot)
ax[0].axvspan(mean-std, mean+std, color='yellow', alpha = 0.4)
ax[0].axvline(x=index, linestyle='--', color = 'k', alpha=0.8)
sns.histplot(np.array(selected_indices), kde=True, ax=ax[1])
ax[1].set(xlim=(0, robot.shape[0]))
fig.tight_layout()
plt.show()
```

As we can clearly see graphically, the region with the highest density corresponds to the center region, which is consistent type of query run.

## Batching queries¶

Shapelets’ implementation of **MASS** supports batching, that is, in the same operation multiple queries can be run
against multiple sequences, maximizing the level of parallelism found in your CUDA or OpenCL device.

To run this mode, simply stack multiple queries, column wise, against multiple sequences (column wise) and run the operation. As an example, let’s extract 10 random subsequences of size 100 from the robot sequence and run them against the very same sequence:

```
>>> l = 100
>>> queries = 10
>>> max_index = robot.shape[0] - l
>>> queries = sc.empty((l,queries), dtype = robot.dtype)
>>> for i in range(10):
... x = np.random.randint(0, max_index)
... queries[:,i] = robot[i:i+l]
>>> multiple_results = sc.matrixprofile.mass(queries, robot)
>>> multiple_results.shape
(12901, 10)
```

When running in batch mode, the results will have as many rows as moving windows (same as in the 1:1 case), but there
will be a new column per query; similarly, if we had multiple series, there will be one additional dimension. In general,
if the operation has `q`

queries of length `m`

, over `s`

series of length `n`

, the result will be a three dimensional
array with dimensions `(n-m+1, q, s)`

.

## References¶

- 1
**Searching and mining trillions of time series subsequences under dynamic time warping.**Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh.KDD 2012: 262-270- 2
**The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance.**Mueen, Abdullah and Zhu, Yan and Yeh, Michael and Kamgar, Kaveh and Viswanathan, Krishnamurthy and Gupta, Chetan and Keogh, Eamonn.August 2017