select_if kernel needs grid boundary or reprogramming tile_idx
zhaolianshuizls opened this issue · comments
As far as I understood, select_if
kernel implements a inter-block barrier to ensure the previous tile is finished before reading prefix sum from it. Since there is no guarantee the thread block is scheduled to execute in blockIdx
order (https://forums.developer.nvidia.com/t/performance-cost-of-too-many-blocks/67982/13), it is possible that lower blockIdx
block is not issued to run in the very first round across the device, which results in hang or other failures. So I think the following might be helpful.
- As the linked discussion pointed out, use an atomic global variable to redefine
tile_idx
(cub/cub/agent/agent_select_if.cuh
Line 686 in b2e8bcc
- Or the grid launch should guarantee that the number of blocks should be small enough to be resident on the device at a time ( )
This is a duplicate of #245