NVIDIA / cuCollections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG]: static_multimap<> insert() not working as expected

PramodShenoy opened this issue · comments

Is this a duplicate?

Type of Bug

Runtime Error

Describe the bug

I have a kernel that tries to insert unique keys and values into the map. On querying the size of the map in the host using get_size(), I observe it is always equal to block size/size of cooperative group.
It seems there is only one entry being inserted into the map per cooperative group. Behavior seems to be similar even when there are collisions.

Here is the host code -

int main(int argc, char const *argv[])
{
    using Key   = int;
    using Value = int;

    // Empty slots are represented by reserved "sentinel" values. These values should be selected such
    // that they never occur in your input data.
    Key constexpr empty_key_sentinel     = -1;
    Value constexpr empty_value_sentinel = -1;

    // Number of key/value pairs to be inserted
    std::size_t constexpr num_keys = 50000;

    // Compute capacity based on a 50% load factor
    auto constexpr load_factor = 0.5;
    std::size_t const capacity = std::ceil(num_keys / load_factor);


    // Constructs a map with "capacity" slots using -1 and -1 as the empty key/value sentinels.
    cuco::static_multimap<Key, Value> map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

    auto device_mutable_view = map.get_device_mutable_view();
    insertMap<<<dim3(1,1,1), dim3(128,1,1)>>>(device_mutable_view);
    cudaDeviceSynchronize();
    std::cout << "Map size = " << map.get_size() << std::endl;
    thrust::device_vector<int> inputs(1);
    inputs.push_back(1);
    size_t size = map.count(inputs.begin(),inputs.end());
    std::cout << size << std::endl;
    thrust::device_vector<cuco::pair<int, int>> res(size);
    map.retrieve(inputs.begin(),inputs.end(), res.begin());
    thrust::host_vector<cuco::pair<int, int>> h_res(res);

    for(int i=0;i<size;i++)
        std::cout<< h_res[i].first << " " << h_res[i].second << "\n";

    return 0;
}

How to Reproduce

Example kernel 1 - all unique keys and values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = tid;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

Example Kernel 2 - Single key with unique values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = 1;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

Example Kernel 3 - Grouped keys with unique values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = tid%10;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

In all the above cases, map.get_size() in the host returns 16.

Expected behavior

In all the above cases, it is expected for the map size to be equal to the number of threads - 128.

Reproduction link

No response

Operating System

No response

nvidia-smi output

Mon Nov 7 20:01:22 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:3B:00.0 Off | Off |
| 31% 24C P0 27W / 260W | 0MiB / 24220MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

@PramodShenoy Thanks for reporting this.

get_size will not work properly if device-view insert is directly invoked by users. Currently, it's the user's responsibility to update size counter if they use device-view insert.
Sorry, you are talking about multimap. Let me look into this and will respond to you shortly.

@PramodShenoy To insert n keys with the CG algorithm, we need n * CGSize threads. The implementation is designed in a way that if CG size = 8, 8 threads from thread 0 to thread 7 will return at the same time if any of the threads in this group successfully finishes an insertion. The same for {thread 8, ... thread 15}, {thread 16, ... thread 23}, etc. The simple way to remember this is 8 threads for one hash map entry. As a result, when the number of total threads is 128 and CG size is 8, you are always inserting 128 / 8 = 16 keys. That's also the reason that the loop stride here is (gridDim.x * block_size) / tile_size as opposed to gridDim.x * block_size.

Make sense?

Thanks @PointKernel for the explanation! This makes sense now. In our case, if we want each thread block to insert 128 keys, we would need 128 * CGSize threads per thread block. So for this case, when should we use 128 thread block size with CGSize = 1 and when should we use 1024 thread block size with CGSize = 8? Is there any workload pattern that you have observed before?

The optimal CG size really depends on your use case. The general guideline is to use large CG (size equals 4 or 8) if occupancy (> 50%) or multiplicity is high and small ones otherwise. CG size = 1 can be beneficial if the occupancy is really low (< 10%).

Yours to discover.

Can you explain what is occupancy and multiplicity here?

Thanks @PointKernel !
Can you please confirm if this line is logically correct if we need to override the Cooperative Group size -

cuco::static_multimap<Key, Value, cuda::thread_scope::thread_scope_device,
                                   cuco::cuda_allocator<char>,
                                   cuco::double_hashing<4, cuco::detail::MurmurHash3_32<Key>, cuco::detail::MurmurHash3_32<Key>>> 
                                   map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

Can you explain what is occupancy and multiplicity here?

Oh, if the hash map capacity is 100 and you insert 60 keys, the occupancy is 60%. If the 60 keys are a sequence from 0 to 59 (all unique), the multiplicity is 1. But if they are all of the same value, the multiplicity will be 60.

Thanks @PointKernel ! Can you please confirm if this line is logically correct if we need to override the Cooperative Group size -

cuco::static_multimap<Key, Value, cuda::thread_scope::thread_scope_device,
                                   cuco::cuda_allocator<char>,
                                   cuco::double_hashing<4, cuco::detail::MurmurHash3_32<Key>, cuco::detail::MurmurHash3_32<Key>>> 
                                   map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

This should work just fine. Note we need to use the same CG size consistently for all operations of a hash map. Using cgsize=2 for insert and cgsize=4 for retrieve will definitely crash at runtime.

Understood.
Thanks a lot!

Hello @PointKernel
Sorry to continue on this closed thread but I think it is related to the use case.
What is the expected behavior of multimap if static_multimap::device_mutable_view::insert() is called in a loop in the kernel as done for block wide functions?

My observation is that duplicate values are being inserted. Is there a defined behavior for this case?

My observation is that duplicate values are being inserted. Is there a defined behavior for this case?

That's correct. All key-value pairs will be inserted in the multimap as long as it still has empty slots. If you expect only unique key-value pairs are present in the map, cuco::static_map is what you are looking for.

However, the same key-value pair should not be inserted twice right?
I am seeing the same key-value pair is inserted twice and they are the only entries in the cuco::multi_map<>

However, the same key-value pair should not be inserted twice right? I am seeing the same key-value pair is inserted twice and they are the only entries in the cuco::multi_map<>

If you call device_mutable_view::insert twice with the same key/value, then the key/value pair will appear twice in the multimap.

This is the important difference between the static_map and static_multimap.

https://godbolt.org/z/esqYPK98T

Take std::unordered_multimap for example, multimap in general doesn't guarantee "uniqueness".

Thank you @PointKernel and @jrhemstad, I just realized that. I was under the impression std::multimap<> ensures Key-Value pair uniqueness.

Do you have any recommendations to tackle this problem -

We have some block wide functions where each thread in the block handles some distinct values. How can we ensure that ALL of these values are inserted into the map?
For example, if we have a block of size 128 with each thread handling 4 items, we would want all 512 items to be inserted in the map. Do you have any recommendations for this kind of problem?

we would want all 512 items to be inserted in the map

Are all 512 keys unique or not?

The keys may not be unique, but the Key-Value pair that we insert would be unique always. We would not want the same Key-Value pair being inserted multiple times either as that would affect a lot of other stuff.

It sounds like you want to treat the entire key/value pair as if it were the key. In other words, it sounds like you want a set instead of a map.

In an ideal scenario, we would want to store a key and a list of values associated with it. If we use a set, then would have to store the Key-Value pair as a single entity right?

Hm, I'm confused.

If you want to store multiple values associated with the same key, then you'd want a multi map. But you also to want to avoid inserting identical key/value pairs (but not identical keys)?

So if you were to insert the following pairs:

{13, 42}
{13, 1}
{9, 18}
{13, 42}

What would you want the contents of the map to be?

Hi Jake,

This is Bobbi.
Basically, we want to implement hash join and try to use cuCollection to implement the hashing part. What we have is a pair of join key and row ID of the table. Join key does not guaranteed to be unique. Row ID is guaranteed to be unique. So for example if we have:

{Join Key, Row ID}\ {13, 42}\ {13, 1}\ {9, 18}\ {13, 3}

We would want the contents of the hash table to be:
9 -> 18 13-> 1, 3, 42

Can we achieve this with cuCollection? If not, then what is the recommended approach for this?

Thanks

Hm, I'm confused.

If you want to store multiple values associated with the same key, then you'd want a multi map. But you also to want to avoid inserting identical key/value pairs (but not identical keys)?

So if you were to insert the following pairs:

{13, 42}
{13, 1}
{9, 18}
{13, 42}

What would you want the contents of the map to be?

In this case, we would want only one entry of {13,42} otherwise they would mess up other operations.
Also, in the case of Bobbi's example, I think we would be fine if we can insert repeating keys separately like this:

9 -> 18 
13-> 1, 
13 ->3, 
13 -> 42

Basically, we want to implement hash join

That would indeed require a multimap.
A simple hash join (unpartitioned inner equi-join) A⋈B with cuco works as follows:

    1. Build a cuco::static_multimap using table A: map.insert(A.begin(), A.end())
    1. Use the keys of table B and retrieve all of their matching values from A: map.retrieve(B.keys.begin(), B.keys.end(), result)

I think we would be fine if we can insert repeating keys separately like this:

9 -> 18 
13-> 1, 
13 ->3, 
13 -> 42

Since cuco doesn't support set yet, you can use a cuco::static_map and take a pair of join key and row id as hash map keys and discard the value part. Custom key equal is required then.

Thanks a lot for the suggestions!

Our system basically uses CUB like block wide functions with 128 threads and each thread handling 4 items. This is used for functions like loading a partition of data, applying predicates, etc. In terms of Cooperative groups of size 8, we could have 16 groups meaning 16 insertions into our hash table.
Do you think there's a way we could do 512 insertions into the map without loading the selected portion of the dataset separately into memory?

The current implementation (without cuco) does something like this - each block loads a tile of the dataset into a shared array and applies predicates (eq, neq, gt, etc.). If the row is selected, we can insert it into the hash table for further join. In case all 512 items per block are selected, we will have 512 insertions done by this block which would be 16/32 via cuco::static_multimap<> if using CG size 8/4 respectively.

What we have is a pair of join key and row ID of the table. Join key does not guaranteed to be unique. Row ID is guaranteed to be unique.

Ah, okay. Now this makes more sense. So this is very similar to how cuDF's hash join works.

We build a hash table where the key is the hash of the row, and the value is the row index. We insert every {row_hash, row_index} pair. There's no need to worry about duplicates because like you said, every row_index is guaranteed to be unique.

Later, in the probing step, we use a special function static_multimap::pair_retrieve that considers the entire pair for equality and it uses a custom equality function that first compares the row_hash, and if those match, then does a full lexicographic comparison of the rows referenced by the row_index value.

Hi @jrhemstad @PointKernel ,

So it seems that it's possible to achieve this with cuCollection. I have a few questions as I try to understand the design.

  1. So in multimap, it seems that we have (by default) 8 threads in a group to hash the same key into the hash table. And once one of this 8 threads manage to insert, the other 7 threads will return. Yunsong mentioned that this will improve the performance as opposed to having each thread hashing different key into the hash table. Can someone explain to me why this improve the performance?
  2. I'm trying to understand Jake's example here. What is row_hash in this case? is it the join key or is it something else? And why do we need to match the entire pair for equality in hash join? Shouldn't we only match the join key?
  1. Check out @sleeepyjack 's paper on cooperative insertion/probing: https://arxiv.org/pdf/2009.07914.pdf
  2. row_hash is the hash value of the row. The strategy I described is necessary to support a join criteria on multiple columns. If you're joining on multiple columns, then you can't fit the entire row as the key of the hash table. If you're just joining on a single column of ints, then you can avoid this complexity.

Hi @PointKernel @jrhemstad

I was trying to use the device_view<> version of retrieve() and needed some help. Is there an example usage of the API that I can refer to? It seems we need to use 2 Cooperative groups - one for probe and another for flushing to the output buffer, is this the right understanding?
Some more questions from the declaration in the header file -

  • What is the flushing_cg_counter and what type are we expecting?
  • Should num_matches be of std::atomic<> type?
  • There are 2 output parameters - output_buffer and output_begin. How are they different and is it necessary for them to be declared in shared memory?

Also, will the retrieve() populate 1 item per CG similar to insert?

Thanks a lot for your help!

Is there an example usage of the API that I can refer to?

You could refer to retrieve kernel for the use of device_view::retrieve. The retrieve kernel is invoked here.

It seems we need to use 2 Cooperative groups - one for probe and another for flushing to the output buffer

Depending on the size of key/value pairs:

  • if they are no bigger than 8 bytes. The implementation uses a per-warp shared memory buffer to write the output. i.e. flushing CG (which is a warp) size = 32 is always larger than probing CG size (which could be 2, 4, 8, etc). So multiple probing CGs write into the same warp-wise buffer.
  • otherwise, probing CG size and flushing CG size is the same and each CG writes into its own buffer.

We introduced this complexity mainly for performance purposes since performance for small pairs is more likely bounded by the flushing operation (write buffer, update atomics, etc). Thus we do per-warp flush as opposed to per-CG one to reduce the number of flushing operations and improve performance.

  • What is the flushing_cg_counter and what type are we expecting?
  • Should num_matches be of std::atomic<> type?
  • There are 2 output parameters - output_buffer and output_begin. How are they different and is it necessary for them to be declared in shared memory?

Please check the retrieve kernel implementation

Also, will the retrieve() populate 1 item per CG similar to insert?

Yes

Hi @PointKernel

I am able to call the retrieve kernel externally using the appropriate parameters similar to the retrieve() function that you pointed to. However, I want to call it from within a device a function which I believe is the device_view::retrieve() function. It seems the appropriate device function is here which is being called in this kernel .

Here is how I am trying to use it which is giving a compiler error.

__global__ void check(cuco::static_multimap<int, int, cuda::thread_scope_device, cuco::cuda_allocator<char>, cuco::double_hashing<4U, cuco::detail::MurmurHash3_32<int>, cuco::detail::MurmurHash3_32<int>>>::device_view dv, 
    thrust::detail::normal_iterator<thrust::device_ptr<int>> first,
    thrust::detail::normal_iterator<thrust::device_ptr<int>> last,
    thrust::detail::normal_iterator<thrust::device_ptr<int>> output,
    std::unique_ptr<atomic_ctr_type, counter_deleter>& counter
    ) {
    auto flushing_cg = tiled_partition<32U>(this_thread_block());
    auto probing_cg  = tiled_partition<4U>(this_thread_block());
    __shared__ int output_buffer[4][100];
    __shared__ uint32_t flushing_cg_counter[4];

   dv.retrieve<100>(flushing_cg, probing_cg, 10, &flushing_cg_counter[0], output_buffer[0], counter.get(), output, thrust::equal_to<int>());
}

The compiler error is -

error: no instance of function template "cuco::static_multimap<Key, Value, Scope, Allocator, ProbeSequence>::device_view::retrieve [with Key=int, Value=int, Scope=cuda::std::__4::__detail::thread_scope_device, Allocator=cuco::cuda_allocator<char>, ProbeSequence=cuco::double_hashing<4U, cuco::detail::MurmurHash3_32<int>, cuco::detail::MurmurHash3_32<int>>]" matches the argument list
            argument types are: (cooperative_groups::__v1::thread_block_tile<32U, cooperative_groups::__v1::thread_block>, cooperative_groups::__v1::thread_block_tile<4U, cooperative_groups::__v1::thread_block>, int, uint32_t *, int [100], atomic_ctr_type *, thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::equal_to<int>)
            object type is: cuco::static_multimap<int, int, cuda::std::__4::__detail::thread_scope_device, cuco::cuda_allocator<char>, cuco::double_hashing<4U, cuco::detail::MurmurHash3_32<int>, cuco::detail::MurmurHash3_32<int>>>::device_view

Maybe unrelated: functionalities in the detail namespace are designed to hide from public access. So anytime you start using it in your code e.g. thrust::detail::normal_iterator ..., something may go wrong.

Can you reproduce the error in godbolt (https://godbolt.org/z/7Ybhjsqxq) and reshare the link here?

I have reproduced the error in godbolt here without using the thrust::detail iterators.

I have reproduced the error in godbolt here without using the thrust::detail iterators.

line 59: output buffer of view::value_type as opposed to int.

line 59: output buffer of view::value_type as opposed to int.

I made some changes and it compiles correctly now. However, I still don't understand why the shared memory type should be cuco::pair<int,int> instead of int only. Isn't the view::value_type actually int?

Also, does a device function that intends to use device_view::retrieve() need to do something similar to the retrieve kernel like calling flush, etc.?

In this Godbolt link, the counter value is not updated after the kernel execution is done. The output value is also not what is expected.

Is there a way to estimate the buffer size that needs to be passed/reserved for retrieve()?

I still don't understand why the shared memory type should be cuco::pair<int,int> instead of int only.

Per https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/static_multimap.cuh#L417-L418, multimap retrieve returns the key and all its matched values as pairs {key, match1}, {key, match2}, ... for a given key thus a single int cannot be sufficient.

Isn't the view::value_type actually int?

https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/static_multimap.cuh#L152

Also, does a device function that intends to use device_view::retrieve() need to do something similar to the retrieve kernel like calling flush, etc.?

Right, since you are writing the kernel on your own. It's your responsibility to flush buffers.

Closing this since the problem seems to be solved.