[ENHANCEMENT]: Cooperative group for insert_and_find

Question

[ENHANCEMENT]: Cooperative group for insert_and_find

hanzhi713 opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.

The newly added device API insert_and_find doesn't use cooperative group. Is it unsafe to use cooperative group for insert_and_find or is it just not yet implemented?

Describe the solution you'd like

Implement a version of insert_and_find that uses cooperative group.

Describe alternatives you've considered

Just use the current version of insert_and_find.

Additional context

Maybe good for performance, especially when the host API for insert_and_find is available.

Daniel Jünger · Answer 1 · Fri Sep 30 2022 23:04:28 GMT+0800 (China Standard Time)

Thanks for catching this!

This would be nice to have for the sake of API consistency.
I don't think this would spawn any unforeseen synchronization problems as long as the coop group is synchronized properly when returning the result.

Is this functionality critical for your application?

Hanzhi Zhou · Answer 2 · Sat Oct 01 2022 04:55:29 GMT+0800 (China Standard Time)

I tested the current version of insert_and_find with the host bulk API implemented myself, and it works well for my application. The coop group version is definitely something nice to have, given the potential performance boost at higher loader factor.

Jake Hemstad · Answer 3 · Sat Oct 01 2022 11:41:51 GMT+0800 (China Standard Time)

I tested the current version of insert_and_find with the host bulk API implemented myself, and it works well for my application. The coop group version is definitely something nice to have, given the potential performance boost at higher loader factor.

I'm curious, what's your use case for insert_and_find in a bulk API? Are the iterators returned from insert_and_find in a bulk API useful?

Hanzhi Zhou · Answer 4 · Tue Oct 04 2022 06:15:13 GMT+0800 (China Standard Time)

I'm sorry for the lack of clarification for the host side API.

What I'm doing is not returning the iterators to the host. Instead, I need to modify the value right after inserting the key into the map, and return the final list of values. More specifically, I need to allocate a unique incrementing index for each new key inserted or return its index if it already exists. I wrote this kernel, which is wrapped in a host function for this purpose.

template <std::size_t block_size,
          typename Key,
          typename Value,
          typename InputIt,
          typename OutputIt,
          typename atomicT,
          typename viewT,
          typename Hash = cuco::detail::MurmurHash3_32<Key>,
          typename KeyEqual = thrust::equal_to<Key>>
__global__ void insert_and_find(
  InputIt first, InputIt last, OutputIt output_first, atomicT* start_index, viewT view, Hash hash = Hash{}, KeyEqual key_equal = KeyEqual{})
{
  __shared__ Value writeBuffer[block_size];

  auto tid = block_size * blockIdx.x + threadIdx.x;
  auto it  = first + tid;
  auto out_it = output_first + tid;
  auto empty_key = view.get_empty_key_sentinel();
  while (it < last) {
    // TODO(maybe?): a slightly more efficient way to do this is to increment the atomic counter
    // inside the device function insert_and_find, so we don't waste the effort of inserting an empty key first.
    typename viewT::value_type const insert_pair{*it, empty_key};
    auto result = view.insert_and_find(insert_pair, hash, key_equal);
    auto slot = result.first;
    Value idx;
    if (result.second) {
      // if insertion is successful, increment the atomic counter
      idx = start_index->fetch_add(1, cuda::std::memory_order_relaxed);
      slot->second.store(idx, cuda::std::memory_order_relaxed);
    } else {
      // wait for the previous store to finish. Is this necessary?
      while (cuco::detail::bitwise_compare(slot->second.load(cuda::std::memory_order_relaxed), empty_key));
      idx = slot->second.load(cuda::std::memory_order_relaxed);
    }

    writeBuffer[threadIdx.x] = idx;
    __syncthreads();
    *out_it = writeBuffer[threadIdx.x];

    it += gridDim.x * block_size;
    out_it += gridDim.x * block_size;
  }
}