Register-only based `WarpExchange`

Question

Register-only based `WarpExchange`

pb-dseifert opened this issue a year ago · comments

Hi CUB maintainers!
in my GPU work, I found that many algorithms require a switch between blocked and striped arrangements, in order to maximise ILP (blocked arrangement) or maximise memory bandwidth (striped arrangement). The current CUB algorithm to accomplish this is WarpExchange, which to my knowledge always uses shared memory to accomplish this end. Using some unrolling tricks, and careful recursive decomposition, I have managed to write a generic form that accomplishes the same, using only registers and warp shuffles. This might be useful in situations where kernels already consume a lot of shared memory but register pressure itself is still low enough to afford warp shuffles.

Would you be willing to entertain a PR for my implementation?

Georgii Evtushenko · Answer 1 · Mon Apr 03 2023 21:44:17 GMT+0800 (China Standard Time)

Hello @pb-dseifert, and thank you for reaching out and showing interest in contributing to our project!

We are definitely interested in this direction. We were going to research the applicability of in-register implementation of warp exchange described in "A Decomposition for In-place Matrix Transposition" by Bryan Catanzaro, Alexander Keller, and Michael Garland. The corresponding implementation can be found here. It'd be interesting to compare your approach with the one described above. I also think it'll make sense to add a template parameter to warp exchange, that would allow one to opt-in new implementation.

Would you be willing to entertain a PR for my implementation?

We would certainly be open to reviewing you PR. Please go ahead and submit your PR. Also, please add a brief description illustrating key differences with the trove implementation.

David Seifert · Answer 2 · Tue Apr 04 2023 02:06:29 GMT+0800 (China Standard Time)

@senior-zero thanks!
Essentially, the impls are the same. All of these ideas are built around the same building blocks:

recursively subpartitioning the warp-matrix
computing offsets to be exchanged, by cajoling the compiler to emit SEL instructions
doing the shuffle
writing the result back, again by trying to cajole the compiler with a SEL instruction.

Trove does the rotate separately, but in my impl it is intertwined with the shuffles. I doubt this distinction matters in practice, since the compiler's dataflow analysis should see through all of this and rearrange instructions for maximum ILP anyways.

To reach peak performance, the indexes need to be computable at compile-time. Unless the array/warp-matrix is a perfect power of two, these indices cannot be computed at compile-time, which leads to local memory use. In my solution, I require power-of-two arrays, since local memory use is unacceptable to me (and often times leads to a 2x slowdown, besides consuming memory bandwidth from other concurrent kernels).

It would be disingenuous of me to claim to have invented all of this myself. I have built my implementation using ideas from

Transposing register-held matrices with warp shuffles? Need help. (which also mentions trove at the end)
Implement 2D matrix transpose using warp shuffle without local memory

@senior-zero would it be acceptable to focus on power-of-two arrays only for the time being?

Georgii Evtushenko · Answer 3 · Tue Apr 04 2023 02:35:16 GMT+0800 (China Standard Time)

@senior-zero would it be acceptable to focus on power-of-two arrays only for the time being?

We can definitely start with this and have a static assert for now:

static_assert(PowerOfTwo<ITEMS_PER_THREAD>::VALUE);

Some of our facilities already require power-of-two warps or have a different specialization for this case. If there's a request to extend support to other problem sizes, we can cover that later. Looking forward to your contribution!

Georgii Evtushenko · Answer 4 · Wed Jul 26 2023 21:05:33 GMT+0800 (China Standard Time)

Addressed by NVIDIA/cccl#256