NVIDIA / cuCollections

Is your feature request related to a problem? Please describe.

cuCollections exposes a set of knobs that allow optimizing a hashing data structure for a specific use case.

For example:

which probing scheme should I use?
what's the best CG size?
how does the input data type affect performance?
can I use particular operations concurrently? How does that impact performance?

The interaction between those choices is also non-trivial.
Finding out which combination works best for an application is a time-consuming task.

Describe the solution you'd like

Write a perf guide. Could be as simple as a Markdown file.

Describe alternatives you've considered

No response

Additional context

No response

We do provide performance guidance in the probing sequence doc, e.g.:

cuCollections/include/cuco/probe_sequences.cuh

Lines 26 to 28 in 4bdf606

    
            * Linear probing is efficient when few collisions are present. Performance hints: 
        
            * - Use linear probing when collisions are rare. e.g. low occupancy or low multiplicity. 
        
            * - `CGSize` = 1 or 2 when hash map is small (10'000'000 or less), 4 or 8 otherwise.

cuCollections/include/cuco/probe_sequences.cuh

Lines 52 to 55 in 4bdf606

    
            * Default probe sequence for `cuco::static_multimap`. Double hashing shows superior 
        
            * performance when dealing with high multiplicty and/or high occupancy use cases. Performance 
        
            * hints: 
        
            * - `CGSize` = 1 or 2 when hash map is small (10'000'000 or less), 4 or 8 otherwise.

Having a performance tuning section in README doesn't seem right.

Right. This would be too mich information for a readme. I would put it in a separate file and link to it from the readme.

	* Linear probing is efficient when few collisions are present. Performance hints:
	* - Use linear probing when collisions are rare. e.g. low occupancy or low multiplicity.
	* - `CGSize` = 1 or 2 when hash map is small (10'000'000 or less), 4 or 8 otherwise.

	* Default probe sequence for `cuco::static_multimap`. Double hashing shows superior
	* performance when dealing with high multiplicty and/or high occupancy use cases. Performance
	* hints:
	* - `CGSize` = 1 or 2 when hash map is small (10'000'000 or less), 4 or 8 otherwise.

[ENHANCEMENT]: Perf guide

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context