Shapelets VectorDB vs. Weaviate:

Head-to-Head Performance Comparison

18 December 2024 | 3 minutes

Shapelets VectorDB vs Weaviate
Adrián Carrio

Lead Data Scientist

18 December 2024
Quotes for highlighted phrases
Shapelets VectorDB is a commercial vector database known for its ultra-low memory footprint and unseen ingestion speed. 

Index

Weaviate vs. Shapelets VectorDB | Which Vector Database Performs Best?

Efficient and scalable vector databases are vital for managing large volumes of unstructured data in Generative AI, especially for tasks like similarity search and recommendation systems. Shapelets VectorDB and Weaviate stand out as leading contenders, each offering distinct strengths in vector storage, retrieval, and indexing. To provide a clear performance comparison, we conducted a detailed benchmark using VectorDBBench — a specialized tool for evaluating vector database efficiency across key metrics. This article explores the results, shedding light on each database’s performance, strengths, and potential applications.
Weaviate is an open-source vector database designed to store, search, and manage high-dimensional vector data, commonly used in industries like enterprise search, e-commerce, media, healthcare, legal tech, and education. In contrast, Shapelets VectorDB is a commercial solution renowned for its ultra-low memory footprint and unparalleled ingestion speed. It is available as both a C++ and Python library.
Selecting a vector database requires careful consideration of performance attributes like capacity, search speed, and filtering efficiency, as these factors directly impact the effectiveness of different applications. This benchmark emphasizes search performance, with a particular focus on fast nearest-neighbor search. Rapid search speeds are critical for applications such as real-time personalization and authentication, where swift responses enable dynamic ad targeting, personalized content recommendations, and identity verification through voice or facial recognition. These use cases demand high-speed nearest-neighbor searches to deliver instant, customized results.
.

Metrics and results

Let’s dive into each of the metrics evaluated in this benchmark:

| Index Building Time.

This metric measures the time it takes for the database to build an index from a set of vectors. The index organizes vector embeddings, optimizing them for efficient similarity searches. A shorter index building time indicates faster readiness for data querying, which is particularly important when dealing with large or frequently updated datasets, like massive data ingestions required in disaster recovery scenarios.

Index Building Time Benchmark

| Recall

Recall is a measure of search accuracy, reflecting the percentage of relevant items retrieved from the database. In vector databases, it’s especially important when finding the closest matches or “nearest neighbors” for a query vector. A higher recall rate signifies more accurate results, which is critical for applications where precision in similarity or relevance is essential, such as recommendation systems and search engines.

Latency Benchmark

| Latency

Latency is the time taken to complete a search query, from initiation to result. In vector databases, low latency is crucial for applications that require real-time or near-real-time responses, such as live recommendations, personalization, or identity verification. Lower latency translates to faster response times, enhancing user experience in applications with time-sensitive demands.

Recall Benchmark

| Maximum QPS (Queries Per Second)

Maximum QPS indicates the maximum number of queries the database can handle per second without degradation in performance. This metric reflects the scalability and robustness of a vector database under high query loads, making it essential for applications that need to handle large volumes of simultaneous requests, such as large-scale recommendation systems or interactive search applications.

Maximum QPS Benchmark

COHERE: The Benchmark Dataset for NLP and Semantic Search

The dataset chosen for this benchmark is the COHERE dataset, a large-scale collection of high-quality text embeddings generated by Cohere’s language models. This dataset is designed to support a variety of natural language processing (NLP) tasks and provides embeddings for diverse text inputs. It enables efficient semantic search, text classification, recommendation, and question-answering applications. The COHERE dataset is frequently utilized in vector databases and machine learning models to enhance the understanding of semantic relationships within massive text corpora. It serves as a valuable resource for benchmarking vector database performance and for developing NLP solutions that require robust text comprehension and similarity matching.
The actual dataset used is available for download from Hugging Face. A typical scenario indexing and querying 1 million embeddings with 768 dimensions is used. Nearest neighbor search is measured across multiple concurrent threads, ranging from 1 to 35, for 30 seconds each. All tests were executed using CPU only in an AWS EC2 virtual machine of type t2.large (2 vCPUs, 8 GiB memory).
Our benchmark results, summarized in the following figure, indicate superior performance of Shapelets VectorDB across all evaluated metrics
.

Maximum QPS Benchmark

Conclusion

The superior performance of Shapelets VectorDB in terms of both efficiency and accuracy highlights its ability to effectively handle intensive vector search workloads, making it a strong choice for applications that require high-speed, high-precision vector retrieval. As the vector database landscape continues to evolve, we will expand our analyses to include comparisons with other leading vector databases, providing further insights into performance benchmarks and the unique strengths of each solution, so be sure to watch for upcoming updates.

Want to apply Shapelets to your projects?

Contact us and let’s study your case.

Pin It on Pinterest

Share This