NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Register-only based `WarpExchange`

pb-dseifert opened this issue · comments

Hi CUB maintainers!
in my GPU work, I found that many algorithms require a switch between blocked and striped arrangements, in order to maximise ILP (blocked arrangement) or maximise memory bandwidth (striped arrangement). The current CUB algorithm to accomplish this is WarpExchange, which to my knowledge always uses shared memory to accomplish this end. Using some unrolling tricks, and careful recursive decomposition, I have managed to write a generic form that accomplishes the same, using only registers and warp shuffles. This might be useful in situations where kernels already consume a lot of shared memory but register pressure itself is still low enough to afford warp shuffles.

Would you be willing to entertain a PR for my implementation?

Hello @pb-dseifert, and thank you for reaching out and showing interest in contributing to our project!

We are definitely interested in this direction. We were going to research the applicability of in-register implementation of warp exchange described in "A Decomposition for In-place Matrix Transposition" by Bryan Catanzaro, Alexander Keller, and Michael Garland. The corresponding implementation can be found here. It'd be interesting to compare your approach with the one described above. I also think it'll make sense to add a template parameter to warp exchange, that would allow one to opt-in new implementation.

Would you be willing to entertain a PR for my implementation?

We would certainly be open to reviewing you PR. Please go ahead and submit your PR. Also, please add a brief description illustrating key differences with the trove implementation.

@senior-zero thanks!
Essentially, the impls are the same. All of these ideas are built around the same building blocks:

  • recursively subpartitioning the warp-matrix
  • computing offsets to be exchanged, by cajoling the compiler to emit SEL instructions
  • doing the shuffle
  • writing the result back, again by trying to cajole the compiler with a SEL instruction.

Trove does the rotate separately, but in my impl it is intertwined with the shuffles. I doubt this distinction matters in practice, since the compiler's dataflow analysis should see through all of this and rearrange instructions for maximum ILP anyways.

To reach peak performance, the indexes need to be computable at compile-time. Unless the array/warp-matrix is a perfect power of two, these indices cannot be computed at compile-time, which leads to local memory use. In my solution, I require power-of-two arrays, since local memory use is unacceptable to me (and often times leads to a 2x slowdown, besides consuming memory bandwidth from other concurrent kernels).

It would be disingenuous of me to claim to have invented all of this myself. I have built my implementation using ideas from

@senior-zero would it be acceptable to focus on power-of-two arrays only for the time being?

@senior-zero would it be acceptable to focus on power-of-two arrays only for the time being?

We can definitely start with this and have a static assert for now:

static_assert(PowerOfTwo<ITEMS_PER_THREAD>::VALUE);

Some of our facilities already require power-of-two warps or have a different specialization for this case. If there's a request to extend support to other problem sizes, we can cover that later. Looking forward to your contribution!