Sparse inverted index: support vector dim in full uint32 range
zhengbuqian opened this issue · comments
In the sparse inverted index, each dimension has an inverted index lookup table for tracking id of vectors with non zero values at this dimension. In the current impl, a std::vector<LUT> inverted_lut_
is used, inverted_lut_[i]
is the look up table for dim i
, and inverted_lut_.size()
is the max dim. The reason I used vector
instead of unordered_map
is that even though they are both O(1) read, for such read heavy performance sensitive use case the higher constant of map lookup would hamper the performance a lot.
The issue with using std::vector
is: the space complexity is O(max dim)
instead of O(number of unique dims)
. For example: if the index has exactly 1 sparse vector with exactly 1 non zero value at dim=10,000,000
, we still have to create a std::vector<LUT>
of size 10,000,000
, thus we can't support arbitrary dim in the uint32 range.
I did a simple benchmark comparing std::vector
vs std::unordered_map
, and it does show performance difference.
Simple benchmark setup:
- same machine, built in release mode, used modified "Test Search" in test_sparse.cc as benchmark case
- randomly generated dataset: nb = 100,000, nq = 10,000, dim = 300,000, vector average number of non-zeros = 30
Results:
The benchmark is not perfect but I think sufficient.
In order to support arbitrary dim in the uint32 range we have to somehow utilize unordered_map. here are some ideas:
- allows the user to choose between vector and unordered map(perhaps by providing a config
with_super_large_dim
to avoid being way too techincal) - use unordered map only for dim above a user specified threshold:
auto lut = dim > cfg.get("dim_threshold").value_or(1'000'000) ? map_luts_[dim] : vec_luts_[dim];
we decided to first use the unordered_map approach for the recent beta release to make sure the availability of the feature.
Closing for now. Revive if necessary