Sparse inverted index: support vector dim in full uint32 range

Question

Sparse inverted index: support vector dim in full uint32 range

zhengbuqian opened this issue 3 months ago · comments

In the sparse inverted index, each dimension has an inverted index lookup table for tracking id of vectors with non zero values at this dimension. In the current impl, a std::vector<LUT> inverted_lut_ is used, inverted_lut_[i] is the look up table for dim i, and inverted_lut_.size() is the max dim. The reason I used vector instead of unordered_map is that even though they are both O(1) read, for such read heavy performance sensitive use case the higher constant of map lookup would hamper the performance a lot.

The issue with using std::vector is: the space complexity is O(max dim) instead of O(number of unique dims). For example: if the index has exactly 1 sparse vector with exactly 1 non zero value at dim=10,000,000, we still have to create a std::vector<LUT> of size 10,000,000, thus we can't support arbitrary dim in the uint32 range.

I did a simple benchmark comparing std::vector vs std::unordered_map, and it does show performance difference.

Simple benchmark setup:

same machine, built in release mode, used modified "Test Search" in test_sparse.cc as benchmark case
randomly generated dataset: nb = 100,000, nq = 10,000, dim = 300,000, vector average number of non-zeros = 30

Results:

The benchmark is not perfect but I think sufficient.

In order to support arbitrary dim in the uint32 range we have to somehow utilize unordered_map. here are some ideas:

allows the user to choose between vector and unordered map(perhaps by providing a config with_super_large_dim to avoid being way too techincal)
use unordered map only for dim above a user specified threshold: auto lut = dim > cfg.get("dim_threshold").value_or(1'000'000) ? map_luts_[dim] : vec_luts_[dim];

Buqian Zheng · Answer 1 · Tue Mar 05 2024 15:08:05 GMT+0800 (China Standard Time)

we decided to first use the unordered_map approach for the recent beta release to make sure the availability of the feature.

Buqian Zheng · Answer 2 · Wed Mar 06 2024 15:10:25 GMT+0800 (China Standard Time)

Closing for now. Revive if necessary