Skip to main content

Similarity Search

Similarity Search encompasses all tasks related with retrieving relevant data (documents, samples, observations, etc) given a query.

Suppose a user is asking how to use ShapeletsVecDB, and we encode this request in a 4-dimensional vector:

Natural Language:how to use Shapelets in Python?Text vector:[how,to,use,Shapelets,in,Pyton,?]Encoded Query:[22,34,1,0]\begin{align*} \text{Natural Language:} & \,\, \text{how to use Shapelets in Python?} \\ \text{Text vector:} & \,\, \left[\text{how},\, \text{to},\, \text{use},\, \text{Shapelets},\, \text{in},\, \text{Pyton},\, \text{?}\right] \\ \text{Encoded Query:} & \,\, \left[22,\, 34,\, 1,\, 0\right] \end{align*}

Further assume that we have stored 3 vectors of documentation files in our ShapeletsVecDB instance:

Shapelets Engine:[1,11,56,20]ShapeletsVecDB Python:[20,29,3,2]Shapelets Data Apps:[120,33,17,25]\begin{align*} \text{Shapelets Engine:} & \,\, \left[1,\, 11,\, -56,\, 20\right] \\ \text{ShapeletsVecDB Python:} & \,\, \left[20,\, 29,\, 3,\, -2\right] \\ \text{Shapelets Data Apps:} & \,\, \left[120,\, 33,\, -17,\, 25\right] \end{align*}

We'd like to retrieve ShapeletsVecDB Python (the second vector) since it is the one more closely related with the initial request.

We can quantify how close two given vectors are with Statistical Distances. One of the most popular ones is the Euclidean Distance:

d(x,y)=i=1n(xiyi)2d(\textbf{x}, \textbf{y}) = \sqrt{\sum_{i=1}^n \left(x_i - y_i\right)^2}

Thus, we get:

d(Encoded Query, Shapelets Engine)=67.97d(Encoded Query, ShapeletsVecDB Python)=6.08d(Encoded Query, Shapelets Data Apps)=102.74\begin{align*} d\left(\text{Encoded Query, Shapelets Engine}\right) &= 67.97 \\ d\left(\text{Encoded Query, ShapeletsVecDB Python}\right) &= 6.08 \\ d\left(\text{Encoded Query, Shapelets Data Apps}\right) &= 102.74 \end{align*}

The minimum distance is 6, correspoding to the ShapeletsVecDB Python document, which means it is the closest vector to the query.

We would use this to either feed an LLM with relevant context or for any other retrieval task.