Have max_codes consider only subset entries in IndexIVF search

Question

Have max_codes consider only subset entries in IndexIVF search

wro-ableton opened this issue a year ago · comments

Summary

Hey! Nice work with V1.7.3! I have a feature request.

Is there a way to have the max_codes stop criteria in IndexIVF searches only consider those entries that actually belong to a subset if one is specified? Wit the current implementation, to my understanding, when the number of scanned entries reaches max_codes, the search is stopped. However, for subset searches, this might happen before we actually scanned max_codes entries in the subset as even entries not in the subset count towards this limit.

Specifically, just as a proof of concept, all that would be necessary is to have scan_one_list not return list_size but instead return the number returned by scanner->scan_codes(list_size, codes, ids, simi, idxi, k); a few lines above. See here.

Obviously, that's just a quick hack only for the IVF index. 🙃 I assume in order to not break the current behavior, this would need to be controlled via an additional search parameter for all indices that have the same behavior currently.

Faiss version: 1.7.3 (19f7696)

Running on:

CPU
GPU

Interface:

C++
Python

Matthijs Douze · Answer 1 · Mon Jan 09 2023 15:56:13 GMT+0800 (China Standard Time)

max_codes is intended to limit the number of distance computations performed, not the number of results.

William Rodewald · Answer 2 · Mon Jan 09 2023 17:53:14 GMT+0800 (China Standard Time)

I get that but currently, max_codes even considers entries outside the subset for which no distances are computed. So when we try to keep with this intended use, shouldn't max_codes ignore entries for which we don't compute distances?

But I see why my suggestion wouldn't be sufficient. scan_codes doesn't return the number of visited entries but the number of updates made to the results. So even with my suggestion, we would need to have scan_codes report how many distances were actually computed.

Actually, now looking at the implementation, I think this behaviour is inconsistent between generic IDSelectors and IDSelectorRange.

To my understanding, for IDSelectorRange, max_codes would only consider entries that were visited if they actually belong to that range selector. In this line, we update the list_size to only include entries which belong to the range subset. However, in the same lambda, for any other selector, which would take effect in scan_codes, we do not update list_size to only include subset entries. So for a range selector, we would have max_codes consider only distance computations within the subset, for any other selector, we would have it consider all entries.

Regardless, I'm not arguing that there has to be a change to how max_codes work. But with the added subset search feature, the API basically doesn't provide a efficient way to ensure that all entries in the subset are scanned. The only way to achieve this currently, to my understanding, is to neither restrict max_codes nor nprobe, in which case we would run an exhaustive search that would not stop even after all subset entries were visited.

William Rodewald · Answer 3 · Thu Feb 02 2023 23:01:19 GMT+0800 (China Standard Time)

Hey @mdouze, could you give this ticket another look? I'd really appreciate it! If you think this is not a bug but expected behavior, maybe we could talk about extending the library to provide a way of limiting the number of visited entries in a subset search.