Mastering vector search at scale

30/11/2024

Auteur(s) :

Pietro Mele

Temps de lecture : 7 minute(s)

Lessons learned from managing a six-billion-vector database, including challenges, insights, and optimization tips.

Introduction to vector search
The vector size dilemma: how much memory do I need?
RAM: How Much Memory Do I Need for Vector Search?
Quantization to reduce memory footprint
What about disk memory
Conclusion

Introduction to vector search

Vector search represents a true revolution in the field of search engines. These, supported by the most advanced artificial intelligence techniques, are now able to understand the meaning of words and make searches more accurate, surpassing the limitations of lexical search, which is based solely on keywords.

However, despite vector search being marketed as the cure-all for the flaws and limitations of traditional search engines, from the tedious implementation of synonyms to the challenge of multilingualism, it comes with a cost: the cost of creating, managing, and searching through vectors, especially when dealing with tens or hundreds of millions.

In this blog, we will dive deep into vector search at scale, exploring the challenges of managing around six billion vectors. But first, if you want to learn more about vector search and how it works, you can follow this article: Understanding the differences between sparse and dense semantic vectors!

The vector size dilemma: how much memory do I need?

Dense vectors are numerical representations of heterogeneous entities such as text, images, or audio. They are typically generated using machine learning models based on transformer architectures. Each dimension of the vector can require up to 8 bytes, as in the case of a float64. Lightweight models often produce vectors with 324 dimensions, while more complex models can go up to 8192 dimensions.

Naturally, the higher the number of dimensions, the better the vector’s ability to capture semantic meaning. However, this improvement comes at the cost of significantly increased memory requirements.

Let’s do some math here.

RAM: How Much Memory Do I Need for Vector Search?

Let’s suppose we want to build a search engine based on a vector database composed of 1 Billion vectors.
For simplicity, let’s assume that each vector corresponds to a document. In reality, a single document can be represented by multiple vectors, either due to its internal structure or because of chunking.

To facilitate semantic search and optimize it for our use case, Elasticsearch uses the HNSW (Hierarchical Navigable Small World) algorithm, which offers an excellent balance between execution speed and search result quality.

The RAM required for a float data type can be calculated using the following formula:

RAM = number of vectors * number of dimensions * 4

If the number of dimensions is the standard 1024, the required RAM will be:

RAM = 1,000,000,000 * 1024 * 4

Breaking it down:

1,000,000,000 vectors
1024 dimensions per vector
4 bytes per dimension

This results in:

RAM = 4,096,000,000,000 bytes = 4000 GB

We also need to account for the memory required to build the HNSW graph, which can be calculated using the formula:

RAM = number of vectors * 4 * HNSW.m

Where HNSW.m represents the number of connections each node can have within the HNSW graph. This parameter can be configured in Elasticsearch during index creation. By default, its value is set to 16. A higher number of connections results in a denser and more accurate graph, but it is slower to traverse and requires more memory.

For example, with 1 billion vectors, the required RAM for the HNSW graph is:

RAM = 1,000,000,000 * 4 * 16

This results in 64 GB of RAM necessary to charge the integrality of the HSNW.

To summarize, to perform semantic search on a vector database of 1 million vectors with 1024 dimensions and an element type of float32, 4064 GB of RAM would be required for optimal performance. That said, it’s important to remember that the memory used for this type of operation is off-heap.

Typically, the most common configuration for a data node is 64 GB of RAM, with half of it dedicated to the JVM, so that optimizations on OOPs (Optimized Object Pointers) can be utilized.

At this point, if we were to roughly calculate the number of nodes required to perform semantic search:

Nodes = 4064 / 32

This would result in a cluster of 127 data nodes, to which you would need to add a master node, an ML node for intra-cluster inference calculations, etc. Obviously, a cluster of this size can pose significant cost challenges, especially when it needs to be replicated across multiple environments (pre-production, development, etc.).

But don’t despair, quantization comes to our rescue!

Quantization to reduce memory footprint

Scalar quantization helps minimize memory usage by converting each element type to a more “compact” version. The conversion can go from:

Float32 -> Float16 -> Int8 -> Int4 -> Bbq (In this case, we are considering Elasticsearch’s automatic quantization, which uses Lucene.)

Of course, quantization comes at the cost of quality. It’s always important to evaluate whether the loss caused by quantization is acceptable in terms of search result relevance. But what does quantization mean in terms of memory savings?

Let’s take Int8 quantization as an example. In this case, the required RAM for 1 billion vectors can be calculated as follows:

RAM = number of vectors * (number of dimensions + 4)

So, for 1 billion vectors with 1024 dimensions, the required memory will be around 1 Terabyte.
In this case, the quality loss is minimal, and at the same time, we would only need 32 data nodes with 64 GB of RAM each to properly run the semantic search. BBQ quantization allows us to achieve this with even fewer resources.

Here are the formulas to calculate the necessary memory:

element_type: float:
num_vectors * num_dimensions * 4
element_type: float with quantization int8:
num_vectors * (num_dimensions + 4)
element_type: float with quantization int4:
num_vectors * (num_dimensions / 2 + 4)
element_type: float with quantization bbq:
num_vectors * (num_dimensions / 8 + 12)

Don’t forget to add the memory required for HNSW, as we’ve done previously, to calculate the total memory needed for your semantic search setup.

What about disk memory ?

Disk memory is a more complicated matter. But let’s try to clarify.

The vector is saved in its Float32 version inside a special data structure in Lucene called knn_vector. The original version of the vector is also stored within _source. However, to save space, it’s possible to exclude this from _source.

To do so, just add exlude to your mapping as follows:

"mappings": {
      "_source": {
        "excludes": [
          "your vector field"
        ]
      }
//The rest of your mapping configuration

By doing so, for 1 billion vectors, you can save approximately 4 TB over 1 bilion vectors. Everything seems fine… well, not exactly.

By default, if we choose the automatic quantization performed by the Lucene engine, Elasticsearch will store not only the quantized vector in the knn_vector object, but also the Float32 version of the vector. As of version 8.17, it is not possible to completely exclude the unquantized vector.

One possible solution would be to quantize the vector outside of Elasticsearch and index the Int8 version by changing the element_type.

Conclusion

As we’ve seen throughout this blog, while vector search offers powerful capabilities, especially in terms of semantic search, it comes with its own set of challenges, particularly when scaling to handle billions of vectors. From understanding the memory requirements to managing disk storage efficiently, we’ve explored how important it is to properly configure your infrastructure to ensure optimal performance.

The size of the vectors, the number of dimensions, and the choice of quantization techniques all significantly impact both RAM and disk memory usage. By leveraging quantization—such as Int8 or BBQ—you can substantially reduce memory consumption while maintaining an acceptable level of quality in search results.

Moreover, optimizing disk memory by excluding unquantized vectors can further contribute to space savings, although certain limitations with Elasticsearch’s default behavior require workarounds. In terms of infrastructure, as seen in our example, a large-scale semantic search system can still be managed with relatively fewer resources if quantization is applied effectively.

In summary, while building and maintaining a large-scale semantic search engine using vectors may seem daunting due to the substantial resource requirements, understanding the nuances of memory management and utilizing optimization techniques such as quantization can drastically reduce costs and make it feasible even for large datasets. Keep in mind that balancing memory efficiency and result quality is key to building an effective search solution at scale.

Understanding the differences between sparse and dense semantic vectors

31/01/2024

More and more frequently, we hear about semantic search and new ways to implement it. In the latest version of OpenSearch (2.11), semantic search through sparse vectors has been introduced. But what does sparse vector mean? How does it differ from dense matrix? Let's try to clarify within this article.

Read the article