rapidsai / raft

Is your feature request related to a problem? Please describe.
During IVF-Flat search a query vector is compared to all the vectors from n_probes clusters, and we have n_queries * n_probes query-probe pairs. For large batch search, when n_queries * n_probes > n_clusters then there will be clusters that are compared to more than one query vector.

The execution time of IVF-Flat is determined by the time to load the clusters from memory. Currently the query-probe pairs are sorted according to query index. To improve memory load time, we can sort the query-probe pairs according to the probe id (cluster label).

Describe the solution you'd like

Sort the query-probe pairs during fine search for better cache reuse. This is already implemented for IVF-PQ, and the same can be applied for IVF-Flat as well:

raft/cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

Lines 529 to 569 in 7342980

    
           auto coresidency = expected_probe_coresidency(index.n_lists(), n_probes, n_queries); 
        
           if (coresidency > 1) { 
        
             // Sorting index by cluster number (label). 
        
             // The goal is to incrase the L2 cache hit rate to read the vectors 
        
             // of a cluster by processing the cluster at the same time as much as 
        
             // possible. 
        
             index_list_sorted_buf.resize(n_queries_probes, stream); 
        
             auto index_list_buf = 
        
               make_device_mdarray<uint32_t>(handle, mr, make_extents<uint32_t>(n_queries_probes)); 
        
             rmm::device_uvector<uint32_t> cluster_labels_out(n_queries_probes, stream, mr); 
        
             auto index_list   = index_list_buf.data_handle(); 
        
             index_list_sorted = index_list_sorted_buf.data(); 
        
             linalg::map_offset(handle, index_list_buf.view(), identity_op{}); 
        
             int begin_bit             = 0; 
        
             int end_bit               = sizeof(uint32_t) * 8; 
        
             size_t cub_workspace_size = 0; 
        
             cub::DeviceRadixSort::SortPairs(nullptr, 
        
                                             cub_workspace_size, 
        
                                             clusters_to_probe, 
        
                                             cluster_labels_out.data(), 
        
                                             index_list, 
        
                                             index_list_sorted, 
        
                                             n_queries_probes, 
        
                                             begin_bit, 
        
                                             end_bit, 
        
                                             stream); 
        
             rmm::device_buffer cub_workspace(cub_workspace_size, stream, mr); 
        
             cub::DeviceRadixSort::SortPairs(cub_workspace.data(), 
        
                                             cub_workspace_size, 
        
                                             clusters_to_probe, 
        
                                             cluster_labels_out.data(), 
        
                                             index_list, 
        
                                             index_list_sorted, 
        
                                             n_queries_probes, 
        
                                             begin_bit, 
        
                                             end_bit, 
        
                                             stream); 
        
           }

Additional context
In IVF-Flat search we typically have 0.1-1% of the clusters searched, therefore this optimization is expected to help with batch size that is correspondingly large (hundreds or thousends of query vectors). We have a helper utility to calculate the expected number of times a cluster is loaded. This can be used to decide whether to sort the input data or not.

raft/cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

Line 396 in 7342980

constexpr inline auto expected_probe_coresidency(uint32_t n_clusters,

This issue shall be implemented as a follow up to #2169, because that PR changes a few details in the IVF-Flat fine search.

	auto coresidency = expected_probe_coresidency(index.n_lists(), n_probes, n_queries);

	if (coresidency > 1) {
	// Sorting index by cluster number (label).
	// The goal is to incrase the L2 cache hit rate to read the vectors
	// of a cluster by processing the cluster at the same time as much as
	// possible.
	index_list_sorted_buf.resize(n_queries_probes, stream);
	auto index_list_buf =
	make_device_mdarray<uint32_t>(handle, mr, make_extents<uint32_t>(n_queries_probes));
	rmm::device_uvector<uint32_t> cluster_labels_out(n_queries_probes, stream, mr);
	auto index_list = index_list_buf.data_handle();
	index_list_sorted = index_list_sorted_buf.data();

	linalg::map_offset(handle, index_list_buf.view(), identity_op{});

	int begin_bit = 0;
	int end_bit = sizeof(uint32_t) * 8;
	size_t cub_workspace_size = 0;
	cub::DeviceRadixSort::SortPairs(nullptr,
	cub_workspace_size,
	clusters_to_probe,
	cluster_labels_out.data(),
	index_list,
	index_list_sorted,
	n_queries_probes,
	begin_bit,
	end_bit,
	stream);
	rmm::device_buffer cub_workspace(cub_workspace_size, stream, mr);
	cub::DeviceRadixSort::SortPairs(cub_workspace.data(),
	cub_workspace_size,
	clusters_to_probe,
	cluster_labels_out.data(),
	index_list,
	index_list_sorted,
	n_queries_probes,
	begin_bit,
	end_bit,
	stream);
	}

[FEA] IVF-Flat optimize loading cluster data for large batch search