Skip to main content

Store & Index Data

info

Throughout this series of tutorials, we'll be using Amazon Products data. You can download the dataset here.

A common RAG project example is a Recommender System. For this, LLMs need a set of product descriptions to make accurate recommendations. We'll show how to perform the basic retrieval steps, loading data into vector database to enable future LLM predictions.

Creating Shelf and Index

We start with all the necessary libraries:

from pathlib import Path
from time import perf_counter

import numpy as np

from sentence_transformers import SentenceTransformer

from shapelets.data import sandbox
from shapelets.indices import EmbeddingIndex
from shapelets.storage import Shelf, BlobStore, Transformer

As in any AI application, we need to collect the data we'd like to store. For simplicity, we'll resort to an existing dataset.

After conveniently renaming the original CSV file we check its schema:

# Read data and get schema
playground = sandbox()
amazon_data = playground.from_csv(
rel_name='amazon_products',
paths=['products.csv'],
delimiter=','
)
print(amazon_data)
Output
Schema
├── Uniq Id - String[*]
├── Product Name - String[*]
├── Brand Name - String[*]
├── Asin - String[*]
├── Category - String[*]
├── Upc Ean Code - String[*]
├── List Price - String[*]
├── Selling Price - String[*]
├── Quantity - String[*]
├── Model Number - String[*]
├── About Product - String[*]
...
└── Product Description - String[*]
info

Learn how to efficiently read and process your data with Data Access here.

We'll focus on two variables: Product Name and About Product (descriptions).

To avoid flawed recommendations, we'll remove any empty descriptions:

# Filter empty descriptions
filter_query = """
SELECT
"Product Name" AS product,
"About Product" AS description
FROM
amazon_products
WHERE
"About Product" IS NOT NULL
"""
amazon_final_data = playground.from_sql(
query=filter_query
)

Then we store descriptions and product names into lists:

# Save product names and descriptions
amazon_dict = amazon_final_data.execute().to_dict()
products = amazon_dict['product']
descriptions = amazon_dict['description']

Since LLMs don't directly support natural language, we need to turn text into numeric data. We select GTE-base-EN-v1.5 as the transformer model to encode the descriptions:

# Load transformer model
model_name = 'Alibaba-NLP/gte-base-en-v1.5'
embeddings_model = SentenceTransformer(
model_name_or_path=model_name,
trust_remote_code=True
)

# Encode data
start = perf_counter()
encoded_products = embeddings_model.encode(sentences=descriptions)
end = perf_counter()
encoding_time = round(end - start, 4)
print(f'Took {encoding_time} seconds to encode all descriptions')
Output
Took 92.0901 seconds to encode all descriptions

Then, we need to create directories for storing and indexing the data (archive and log, repectively). Additionally, we'll create a directory for the original documents called tape:

# Define paths
base_path = Path('./')
archive_path = base_path / 'archive'
tape_path = base_path / 'tape'
index_path = base_path / 'log'

# Create if not existing
archive_path.mkdir(exist_ok=True, parents=True)
tape_path.mkdir(exist_ok=True, parents=True)
index_path.mkdir(exist_ok=True, parents=True)

We instantiate Shelf, BlobStore and EmbeddingIndex objects, which will enable us to programmaticaly interact with our vector database:

# Shelf for storing data
archive = Shelf.local(
chain=[Transformer.compressor()],
base_directory=archive_path
)

# Blobstore for original documents
tape = BlobStore.create(
shelf=archive,
base_directory=tape_path,
)

# Index vector data
index = EmbeddingIndex.create(
dimensions=encoded_products[0].shape[0], # length of each embedding
shelf=archive,
base_directory=index_path
)
info

Learn more about Shelf, Tape and EmbeddingIndex here.

After this, we assign a unique ID to each document and build the index:

# Store embeddings by iterating over the data
start = perf_counter()
builder = index.create_builder()
for product_index in range(len(products)):
encoded_document = np.frombuffer(
buffer=descriptions[product_index].encode(),
dtype=np.uint8
)
ext_id = tape.write(
mime='text/plain',
buffer=encoded_document.astype(np.uint8)
)
builder.add(
embedding=encoded_products[product_index],
external_key=ext_id
)
builder.upsert()
end = perf_counter()
print(f'It took {end - start} seconds to store all embeddings')
Output
It took 0.4196530039989739 seconds to store all embeddings

Loading an Index

What if we have already created an Index? Do we need to recreate it again? Absolutely not!

Firstly, we'll need to save our Index object, in this case, EmbeddingIndex:

# Save index
index.save(index_path='embedding.index')

This will create a .index file that will be used when we call the load method:

# Identify shelf path
index = EmbeddingIndex.load(
index_path='./embedding.index',
shelf=archive, # your associated Shelf object
store_directory='./log' # your former index path
)