Clarification requested on number of centroids to use when training IVF index for a distributed index
leothomas opened this issue · comments
Summary
The Guidelines to choosing an index page suggest that for an IVF index, the number of centroids to use (K) should be:
Where K is 4sqrt(N) to 16sqrt(N), with N the size of the dataset.
If we're making use of a distributed index should N be the size of the entire dataset? Or the size of the subset hosted on each machine/shard?
Platform
N/A
Reproduction instructions
N/A
IF you shard the dataset over several machines, then they act as independent datasets so the relevant size is that of a single shard.
Awesome, thank you!
And just to confirm, the index should be trained once using a randomly selected subset of the total dataset and re-used for each shard, rather than training a different index for each shard. Is correct?
Thank you for the clarification!