zilliztech / knowhere

Knowhere is an open-source vector search engine, integrating FAISS, HNSW, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sparse inverted index: support vector dim in full uint32 range

zhengbuqian opened this issue · comments

In the sparse inverted index, each dimension has an inverted index lookup table for tracking id of vectors with non zero values at this dimension. In the current impl, a std::vector<LUT> inverted_lut_ is used, inverted_lut_[i] is the look up table for dim i, and inverted_lut_.size() is the max dim. The reason I used vector instead of unordered_map is that even though they are both O(1) read, for such read heavy performance sensitive use case the higher constant of map lookup would hamper the performance a lot.

The issue with using std::vector is: the space complexity is O(max dim) instead of O(number of unique dims). For example: if the index has exactly 1 sparse vector with exactly 1 non zero value at dim=10,000,000, we still have to create a std::vector<LUT> of size 10,000,000, thus we can't support arbitrary dim in the uint32 range.

I did a simple benchmark comparing std::vector vs std::unordered_map, and it does show performance difference.

Simple benchmark setup:

  • same machine, built in release mode, used modified "Test Search" in test_sparse.cc as benchmark case
  • randomly generated dataset: nb = 100,000, nq = 10,000, dim = 300,000, vector average number of non-zeros = 30

Results:

image

The benchmark is not perfect but I think sufficient.

In order to support arbitrary dim in the uint32 range we have to somehow utilize unordered_map. here are some ideas:

  1. allows the user to choose between vector and unordered map(perhaps by providing a config with_super_large_dim to avoid being way too techincal)
  2. use unordered map only for dim above a user specified threshold: auto lut = dim > cfg.get("dim_threshold").value_or(1'000'000) ? map_luts_[dim] : vec_luts_[dim];

we decided to first use the unordered_map approach for the recent beta release to make sure the availability of the feature.

Closing for now. Revive if necessary