Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use?

Question

Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use?

robotheart opened this issue a year ago · comments

Summary

Hi, thanks to the Faiss team for making this library available!

We have a usecase where we have nearly 1 billion vectors of sentence embeddings dimension 384 each. We need to build an index of all of these and have a memory constraint of 4.5 GB max index size (ideally, we'd be a little smaller than this size, as the dataset grows daily).

From my understanding an index built with configs IVF65536_HNSW32,PQ32 would get us the smallest memory footprint [ref: https://towardsdatascience.com/ivfpq-hnsw-for-billion-scale-similarity-search-89ff2f89d90e]
but when I do this we still have an index size of ~14GB.

Is there any other combination we should try? Or is an index of 4.5 GB not possible given how big our vectors/dataset are?

Thank you!

Faiss version: faiss-cpu 1.7.3

Installed from: Pypi [https://pypi.org/project/faiss-cpu/]

Running on:

[ X] CPU
GPU

Interface:

C++
[ X] Python

Reproduction instructions

NA

*please let me know also if I need to include more of the details related to what version Faiss etc I am using. Since this is just a general question I omitted those currently for brevity.

Matthijs Douze · Answer 1 · Tue Dec 20 2022 05:36:11 GMT+0800 (China Standard Time)

IVF65536_HNSW32,PQ32 uses at least 40G for 1B vectors (32 bytes for the PQ + 8 bytes for the vector ID), so unclear where the 14GB number comes from.
1B vectors in 4.5G means you allocate 4.5 bytes per sentence, which is very little. This may be possible only if you can group the vectors in a meaningful way (eg. if there are many small variations of the same vector).

robotheart · Answer 2 · Thu Dec 22 2022 02:56:13 GMT+0800 (China Standard Time)

Hi @mdouze Thank you so much for getting back to me on this!
Huh, so it sounds like whatever result I got doesn't actually make sense.
Thanks for breaking down the math, it sounds like may not be possible for our use case.
Our dataset is such that we DO have groups of vectors that are small variations of each other, but it isn't like, 10% of the data is like this. Rather each grouping might be .001% of the data at most, and there are like millions of these small groupings...

Is there any sort of other advice for what we might try if you can think of any? Thank you again for the help so far!