NVIDIA / cuCollections

Is your feature request related to a problem? Please describe.

I would like to be able to run a function on every key that is inserted by the bulk insert functions.

Describe the solution you'd like

struct MyFunctor {
  void operator()(slot_type *slot, bool is_unique) {
    ...
  }
};

my_set.insert(input.begin(), input.end(), MyFunctor());

Each insert will return the slot and whether this was an insert or the key was already there so this was just a lookup. And then I will do whatever I want with that information. For example, I could use an atomic and fill an array with unique elements, allowing me to perform "retrieve_all" during the insert without reading the table twice.

Describe alternatives you've considered

I can copy and paste all the bulk insert code into my code and then use that. The problem here is that I trust cuco to get the grid shape and everything right and if I do it this way then I might get it wrong. Also, if cuco comes up with improvements to the bulk insert, I could use them.

Additional context

No response

Hm, I'm not against this, but this kind of custom operation seems like what the device-side API is intended for.

I'd like to avoid duplicating the work that cuco has already done. So if I can use the bulk operations, all the better!

I totally get your point about avoiding writing a kernel that is already 90% existent in cuco.

It's really about API flexibility:

Our host bulk APIs aren't very flexible, but solve a simple job fast with just one LOC.
The device APIs give you full control (as it even allows you to do kernel fusing, like we did in our GROUPBY work), however it requires a considerable amount of effort on the developer side to get it right.

The solution you are proposing is somewhere in between:
You get the benefit of the easy-to-use host API, plus some additional bells and whistles which otherwise only the device API could give you.

So, when a developer plans to use cuco for a particular complex task, they have the following thought process:

Does the host APi suffice to solve my problem? -> No.
Does a custom functor work? -> Maybe. Requires some investigation.
Does the device API solve my problem? -> Most likely yes, but also takes some time to get it right.

I'm not convinced 2. is helpful when we can just move to 3. directly and be able to optimize the hell out of the custom kernel.

However, we could solve your particular problem with a different approach:
Let's say we provide a bulk API for insert_and_find, returning the tuples {slot_reference, is_new}.
The output tuples will be written to some output iterator.
Instead of naively writing the output to some memory location directly, you could employ one of Thrust's fancy iterators, namely thrust::transform_[input_]output_iterator, which takes your custom functor and applies it to the output pairs.

So you can achieve the same thing that you asked for by using the bulk API with Thrust fancy iterators.

@PointKernel @jrhemstad I have one minor concern about my proposal:
We use a shmem bounce buffer to stage the result tuples and then flush them to the output iterator using memcpy_async. Does this work with Thrust fancy iterators?

The output iterator is a nice idea. No way it works with the memcpy_async, right? Surely memcpy_async requires that the destination is a real pointer. Worse still, it might compile and work just fine on A100 and before but fail on Hopper.

Correction: We use the memcpy_async approach only in multimap. For static_map/static_set we store the output directly:

cuCollections/include/cuco/detail/static_map/kernels.cuh

Line 121 in 359f5ae

    
           *(output_begin + idx) = found == ref.end() ? ref.empty_value_sentinel() : (*found).second;

So using a fancy iterator should work in this scenario.

[ENHANCEMENT]: Add a callback to the bulk_insert functions

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context