First Steps with Data Indexing

We demonstrate how easy it is to index and store data using Data Indexing with a simple similarity search case.

Concretely, we'll use data from Quora, which provided a dataset of questions in Kaggle.

info

Lear more about reading, loading and transforming data with Data Access here.

Let's start by importing all the necessary libraries and reading the data

from pathlib import Path
from time import perf_counter

from sentence_transformers import SentenceTransformer

from shapelets.data import sandbox
from shapelets.indices import EmbeddingIndex
from shapelets.storage import Shelf, Transformer

# Read data
playground = sandbox()
quora_data = playground.from_csv(
    rel_name='quora_data',
    paths=['questions.csv']
)

# Check schema of the data
print(quora_data)

Output
Schema
├── id - Int64
├── qid1 - Int64
├── qid2 - Int64
├── question1 - String[*]
├── question2 - String[*]
└── is_duplicate - Int64

There are two columns of questions with their correponding identifiers, as well as the target variable for the Kaggle competition.

In our case, we'll simply focus on the questions from question1.

# Take a sample of questions
sample_query = """
SELECT
    question1
FROM 
    quora_data
LIMIT 3
"""
header = playground.execute(query=sample_query)
print(header.to_dict())

Output
{
  'question1': [
      'What is the step by step guide to invest in share market in india?', 
      'What is the story of Kohinoor (Koh-i-Noor) Diamond?', 
      'How can I increase the speed of my internet connection while using a VPN?'
  ]
}

Now we'll load the Embeddings model to encode text into numerical vectors.

We'll use a multilingual open-source transformer model from HugginFace.

# Load embeddings model
model_name = 'quora-distilbert-multilingual'
embeddings_model = SentenceTransformer(model_name_or_path=model_name)

We now get all the questions into a list and encode them

# Get questions into a list
questions_query = """
SELECT
    question1
FROM 
    quora_data
"""
questions_dict = playground.execute(query=questions_query).to_dict()
questions = questions_dict['question1']

# Encode questions
print(f'Encoding {len(questions)} questions...')
start = perf_counter()
encoded_questions = embeddings_model.encode(sentences=questions)
end = perf_counter()
encoding_time = round(end - start, 4)
print(f'Took {encoding_time} seconds to encode all questions\n')

Output

Encoding 404351 questions...
Took 233.7485 seconds to encode all questions

It took 233.7485 seconds (almost 4 minutes) to map all text into embeddings.

We now focus on indexing and storing the data locally, for which we'll define archive and log directories first.

# Define paths
base_path = Path('./')
archive_path = base_path / 'archive'
index_path = base_path / 'log'

# Set up archive and index
archive = Shelf.local(
    chain=[Transformer.compressor()], 
    base_directory=archive_path
)
index = EmbeddingIndex.create(
    dimensions=encoded_questions[0].shape[0], 
    shelf=archive, 
    base_directory=index_path
)

It's now time to index each document:

# Store embeddings
doc_id = 0
print(f'Storing {len(questions)} vectors...')
start = perf_counter()
builder = index.create_builder()
for vector in encoded_questions:
    builder.add(
        embedding=encoded_questions[doc_id], 
        external_key=doc_id
    )
    doc_id += 1
builder.upsert()
end = perf_counter()
print(f'Took {end - start} seconds to store the data')

Output

Storing 404351 vectors...
Took 18.034041310002067 seconds to store the data

Less than 20 seconds were necessary to store all of the 404351 vectors!

Let us now perform some similarity search.

# Perform similarity search
query = embeddings_model.encode(
    sentences='What is the best american university?'
)
results = index.search(target=query, k=5)

# Display results
for result in results:
    doc_id = f'Document ID: {result.oid}'
    distance = f'Distance: {result.distance}'
    question = f'\nQuestion: {questions[result.oid]}'
    print(f'{doc_id}, {distance} {question}')

Output
Document ID: 98321, Distance: 3.6755452156066895
Question: Which is the best university in the USA? 
Document ID: 390779, Distance: 4.471233367919922
Question: Which are the best universities in USA? 
Document ID: 270490, Distance: 6.144722938537598
Question: What are the Best university in US? 
Document ID: 16115, Distance: 8.84965991973877
Question: Why USA has the best universities? 
Document ID: 188879, Distance: 9.09403133392334
Question: What are some of the best public universities in the US?