facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

Home Page:https://faiss.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification requested on number of centroids to use when training IVF index for a distributed index

leothomas opened this issue · comments

Summary

The Guidelines to choosing an index page suggest that for an IVF index, the number of centroids to use (K) should be:

Where K is 4sqrt(N) to 16sqrt(N), with N the size of the dataset.

If we're making use of a distributed index should N be the size of the entire dataset? Or the size of the subset hosted on each machine/shard?

Platform

N/A

Reproduction instructions

N/A

IF you shard the dataset over several machines, then they act as independent datasets so the relevant size is that of a single shard.

Awesome, thank you!

And just to confirm, the index should be trained once using a randomly selected subset of the total dataset and re-used for each shard, rather than training a different index for each shard. Is correct?

Thank you for the clarification!