Clarification requested on number of centroids to use when training IVF index for a distributed index

Question

Clarification requested on number of centroids to use when training IVF index for a distributed index

leothomas opened this issue a year ago · comments

Summary

The Guidelines to choosing an index page suggest that for an IVF index, the number of centroids to use (K) should be:

Where K is 4sqrt(N) to 16sqrt(N), with N the size of the dataset.

If we're making use of a distributed index should N be the size of the entire dataset? Or the size of the subset hosted on each machine/shard?

Platform

N/A

Reproduction instructions

N/A

Matthijs Douze · Answer 1 · Mon Feb 13 2023 11:10:02 GMT+0800 (China Standard Time)

IF you shard the dataset over several machines, then they act as independent datasets so the relevant size is that of a single shard.

Leo Thomas · Answer 2 · Wed Feb 15 2023 05:26:28 GMT+0800 (China Standard Time)

Awesome, thank you!

And just to confirm, the index should be trained once using a randomly selected subset of the total dataset and re-used for each shard, rather than training a different index for each shard. Is correct?

Matthijs Douze · Answer 3 · Wed Feb 15 2023 18:36:01 GMT+0800 (China Standard Time)

yes. See also https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM

Leo Thomas · Answer 4 · Mon Feb 20 2023 00:52:36 GMT+0800 (China Standard Time)

Thank you for the clarification!