rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

Home Page:https://docs.rapids.ai/api/raft/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEA] IVF-Flat optimize loading cluster data for large batch search

tfeher opened this issue · comments

Is your feature request related to a problem? Please describe.
During IVF-Flat search a query vector is compared to all the vectors from n_probes clusters, and we have n_queries * n_probes query-probe pairs. For large batch search, when n_queries * n_probes > n_clusters then there will be clusters that are compared to more than one query vector.

The execution time of IVF-Flat is determined by the time to load the clusters from memory. Currently the query-probe pairs are sorted according to query index. To improve memory load time, we can sort the query-probe pairs according to the probe id (cluster label).

Describe the solution you'd like

Sort the query-probe pairs during fine search for better cache reuse. This is already implemented for IVF-PQ, and the same can be applied for IVF-Flat as well:

auto coresidency = expected_probe_coresidency(index.n_lists(), n_probes, n_queries);
if (coresidency > 1) {
// Sorting index by cluster number (label).
// The goal is to incrase the L2 cache hit rate to read the vectors
// of a cluster by processing the cluster at the same time as much as
// possible.
index_list_sorted_buf.resize(n_queries_probes, stream);
auto index_list_buf =
make_device_mdarray<uint32_t>(handle, mr, make_extents<uint32_t>(n_queries_probes));
rmm::device_uvector<uint32_t> cluster_labels_out(n_queries_probes, stream, mr);
auto index_list = index_list_buf.data_handle();
index_list_sorted = index_list_sorted_buf.data();
linalg::map_offset(handle, index_list_buf.view(), identity_op{});
int begin_bit = 0;
int end_bit = sizeof(uint32_t) * 8;
size_t cub_workspace_size = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
cub_workspace_size,
clusters_to_probe,
cluster_labels_out.data(),
index_list,
index_list_sorted,
n_queries_probes,
begin_bit,
end_bit,
stream);
rmm::device_buffer cub_workspace(cub_workspace_size, stream, mr);
cub::DeviceRadixSort::SortPairs(cub_workspace.data(),
cub_workspace_size,
clusters_to_probe,
cluster_labels_out.data(),
index_list,
index_list_sorted,
n_queries_probes,
begin_bit,
end_bit,
stream);
}

Additional context
In IVF-Flat search we typically have 0.1-1% of the clusters searched, therefore this optimization is expected to help with batch size that is correspondingly large (hundreds or thousends of query vectors). We have a helper utility to calculate the expected number of times a cluster is loaded. This can be used to decide whether to sort the input data or not.

constexpr inline auto expected_probe_coresidency(uint32_t n_clusters,

This issue shall be implemented as a follow up to #2169, because that PR changes a few details in the IVF-Flat fine search.