Retrieve data with Langchain

In this tutorial, you will learn how to create a search engine that can answer questions in natural language about a set of documents using Langchain.

First, add all the required imports.

from rich import print
import warnings

from dotenv import load_dotenv
from langchain_openai import OpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.schema import AttributeInfo
from shapelets.data import DataType
from shapelets.indices.langchain import(
  ShapeletsVecDB,
  ShapeletsVecDBTranslator
)

It is good practice to use a .env file to store your secrets. We will store here the OpenAI API Key. You might want to disable tokenizer parallelization, as it is reported to create issues in some setups:

OPENAI_API_KEY='XXX'
TOKENIZERS_PARALLELISM=False

Now load these environment variables:

warnings.simplefilter('ignore')
if not load_dotenv('./.env'):
    print("Problem loading .env")

The documents to be used in this tutorial are movies. Each movie is added to a list as a Document, and each Document has two fields: -page_content, which contains a short description of the movie. -Metadata, in which we can store additional data about the movie. Each of these additional items will be indexed independently, so that the system can rely on them to contexualize the searches.

docs = [
    Document(
        page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose',
        metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'},
    ),
    Document(
        page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...',
        metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2},
    ),
    Document(
        page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea',
        metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6},
    ),
    Document(
        page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them',
        metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3},
    ),
    Document(
        page_content='Toys come alive and have a blast doing so',
        metadata={'year': 1995, 'genre': 'animated'},
    ),
    Document(
        page_content='Three men walk into the Zone, three men walk out of the Zone',
        metadata={
            'year': 1979,
            'director': 'Andrei Tarkovsky',
            'genre': 'thriller',
            'rating': 9.9,
        },
    ),
]

For the metadata indices to work properly, we need to define the characteristics of each attribute in the metadata. In particular, each attribute should be assigned a name, a description (which will be used to determine whether that specific attribute is relevant to the user query) and a data type. Assigning a data type gives you more control of the indexing process, to ensure that minimal memory footprint will be required.

metadata_field_info = [
    AttributeInfo(
        name='genre',
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type='string'),
    AttributeInfo(
        name='year',
        description='The year the movie was released',
        type='integer'),
    AttributeInfo(
        name='director',
        description='The name of the movie director',
        type='string'),
    AttributeInfo(
        name='rating',
        description='A 1-10 rating for the movie',
        type='float')
]

In this step we define various things:

The model used to create the embeddings for the movie descriptions. We will be using all-MiniLM-L6-v2 from HuggingFace.
The vector store. We will be adding:
- The documents.
- The embeddings corresponding to the documents' descriptions.
- The metadata to be used as indices. Again, we can define very specific data types to ensure we use no more than the required amount of memory.
A description of the content of each document.
The LLM to be used. Here we will be using OpenAI's GPT-4 model.
Langchain's retriever, which takes all the previous items, the metadata field info and a structured query translator.

The structured query translator allows to extract filters from the user query and apply them to the vector store.

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

vector_store = ShapeletsVecDB.from_documents(docs, embeddings, indices={
    'genre': DataType.string(),
    'year': DataType.int16(),
    'director': DataType.string(),
    'rating': DataType.float32(4, 2),
})

document_content_description = 'Brief summary of a movie'
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vector_store,
    document_content_description,
    metadata_field_info,
    structured_query_translator=ShapeletsVecDBTranslator()
)

Now we can ask some questions and get some answers:

question = "What's a highly rated (above 8.5) science fiction film?"
print(question)
for r in retriever.invoke(question):
    print(r)

question = "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
print(question)
for r in retriever.invoke(question):
    print(r)

Once finished, it is important to close the vector store to release the resources.

vector_store.close()