NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with cub::BlockExchange

jjxyai opened this issue · comments

I try to do matrix transpose with cub::BlockExchange and save inplace, but get partial exchanged items, the code and output is as follow:
`
template <typename T, int BLOCK_THREADS, int ITEMS_PER_THREAD>
global void BlockExchangeKernel(T *d_in)
{
typedef cub::BlockExchange<T, BLOCK_THREADS, ITEMS_PER_THREAD> BlockExchange;
shared typename BlockExchange::TempStorage temp_storage;
T thread_data[ITEMS_PER_THREAD];

  cub::LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in, thread_data);
  BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);
  cub::StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in, thread_data);

  //cub::LoadDirectBlocked(threadIdx.x, d_in, thread_data); 
  //BlockExchange(temp_storage).BlockedToStriped(thread_data, thread_data);
  //cub::StoreDirectBlocked(threadIdx.x, d_in, thread_data);

}

`

`
matrix before:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

matrix exchanged:
0 32 64 96
1 33 65 97
2 34 66 98
3 35 67 99
4 36 68 100
5 37 69 101
6 38 70 102
7 39 71 103 // ok for the first 32 items. but not expected for the rest.
32 33 34 35
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
64 65 66 67
68 69 70 71
72 73 74 75
76 77 78 79
80 81 82 83
84 85 86 87
88 89 90 91
92 93 94 95
96 97 98 99
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
116 117 118 119
120 121 122 123
124 125 126 127
`
try to found but failed, could anyone help me? tks!!

Hello @jjxyai!

I believe that block exchange behaves as expected here. Let's take a look at the first thread for instance:

[0, 32, 64, 96] // after striped load
[0,  1,  2,  3]    // after exchange

The issue should be related to the way you put that data back into memory. StoreDirectStriped writes to threadIdx.x + ITEM * BLOCK_THREADS, which is [0, 32, 64, 96] for BLOCK_THREADS = 32. Instead, you'd like the stride to be ITEMS_PER_THREAD. You can try something along the following lines to achieve that:

cub::StoreDirectStriped<ITEMS_PER_THREAD>(threadIdx.x * ITEMS_PER_THREAD * ITEMS_PER_THREAD, d_in, thread_data);

I tried to change the stride, get different and not expected result but with the same problem: only the first 32 items are modified. actually i'm a little confused with
cub::StoreDirectStriped<ITEMS_PER_THREAD>(threadIdx.x * ITEMS_PER_THREAD * ITEMS_PER_THREAD, d_in, thread_data).
In my understanding,

cub::LoadDirectBlocked(threadIdx.x, d_in, thread_data);
__syncthreads();
cub::StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in, thread_data);

should work out without the use of BlockExchange, but still, it return the same results as

  cub::LoadDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in, thread_data);
  BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);
  cub::StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_in, thread_data);

i found An Efficient Matrix Transpose in CUDA C/C++, it porvide a different way and is working fine, but i still haven't figure out the problem with cub.

The issue is not related to loading/exchanging the data:

cub::LoadDirectBlocked(threadIdx.x, d_in, thread_data);
__syncthreads();

is functionally equivalent to

cub::LoadDirectBlocked(threadIdx.x, d_in, thread_data);
BlockExchange(temp_storage).StripedToBlocked(thread_data, thread_data);

As I noted above, the issue is in the way you write results. cub::StoreDirectStriped<BLOCK_THREADS>(threadIdx.x) means that each thread will first offset to threadIdx.x and then write each element with a step that's equal to BLOCK_THREADS.

In other words, for BLOCK_THREADS=32 and ITEMS_PER_THREAD=4, we have:

  • first thread writing to 0 + 0 * 32 = 0, 0 + 1 * 32 = 32, ..., 0 + 3 * 32 = 96
  • second thread writing to 1 + 0 * 32 = 1, 1 + 1 * 32 = 33, ..., 1 + 3 * 32 = 97

instead, you want the result to be written as:

  • first thread: 0 * 4 * 4 + 0 * 4 = 0, 0 * 4 * 4 + 1 * 4 = 4, ..., 0 * 4 * 4 + 3 * 4 = 12
  • second thread: 1 * 4 * 4 + 0 * 4 = 16, 1 * 4 * 4 + 1 * 4 = 20, ..., 1 * 4 * 4 + 3 * 4 = 28

In other words, the initial offset is threadIdx.x * ITEMS_PER_THREAD * ITEMS_PER_THREAD and stride is ITEMS_PER_THREAD.

Note that it'll just give you a functionally correct result. Since the write is not coalesced, performance is not great as it could be. To achieve better performance, you'd actually need the right to look like cub::StoreDirectStriped<BLOCK_THREADS>(threadIdx.x).

Also, your original version works if you swap items per thread and block threads:

constexpr int block_threads = 4;
constexpr int items_per_thread = 32;

cub::LoadDirectStriped<block_threads>(threadIdx.x, values, thread_data);
block_exchange(temp_storage).StripedToBlocked(thread_data, thread_data);
cub::StoreDirectStriped<block_threads>(threadIdx.x, values, thread_data);