Skip to main content

Index

Indexing in Vector Databases

Indexing in vector databases is a crucial process that enhances the efficiency of querying high-dimensional data. This process involves creating a data structure, known as an Index, which allows for quick retrieval of vectors similar to a given query vector. Indexing reduces the need to perform exhaustive searches across all stored vectors, thereby significantly speeding up the search process. This is especially important in applications involving large datasets, such as image recognition, natural language processing, and recommendation systems, where timely and accurate retrieval of similar items is paramount.

What is an Index?

An Index in a vector database is a specialized data structure that organizes vectors to enable fast similarity searches. Common types of indices include tree-based structures like KD-trees, graph-based structures like HNSW (Hierarchical Navigable Small World), and hashing techniques like LSH (Locality-Sensitive Hashing). Each type of index has its strengths and weaknesses depending on the nature of the data and the specific requirements of the application, such as query speed, memory usage, and the dimensionality of the vectors.

Why Indexing?

Benefits of Indexing The primary benefit of indexing is the significant reduction in query times, which is critical for applications requiring real-time or near-real-time responses. By efficiently organizing the data, an index allows the database to quickly narrow down the search space to the most relevant vectors, thus improving performance. Moreover, indices help in managing and scaling large datasets by reducing the computational resources needed for search operations. Properly designed and implemented indexing strategies can make vector databases highly effective for tasks involving similarity searches and recommendations.

Indexing with Shapelets

Our vector database supports both embeddings (vectors) and scalar data, with specific indices tailored for each format to efficiently manage the data. Each index dynamically handles data blocks, discarding those not used within a certain time frame. However, this doesn't mean the information is lost. A Shelf stores these closed blocks for long-term storage and archives them. When a query is made, a block is selected based on similarity search, and the Shelf can retrieve this block back into the index.

Furthermore, original data can be retrieved thanks to Tape, a smart storage system that keeps track of the raw data before encoding.