NVIDIA / cub

As far as I understood, select_if kernel implements a inter-block barrier to ensure the previous tile is finished before reading prefix sum from it. Since there is no guarantee the thread block is scheduled to execute in blockIdx order (https://forums.developer.nvidia.com/t/performance-cost-of-too-many-blocks/67982/13), it is possible that lower blockIdx block is not issued to run in the very first round across the device, which results in hang or other failures. So I think the following might be helpful.

As the linked discussion pointed out, use an atomic global variable to redefine tile_idx (

cub/cub/agent/agent_select_if.cuh

Line 686 in b2e8bcc

int tile_idx = (blockIdx.x * gridDim.y) + blockIdx.y; // Current tile index

)
Or the grid launch should guarantee that the number of blocks should be small enough to be resident on the device at a time (

cub/cub/device/dispatch/dispatch_select_if.cuh

Line 422 in b2e8bcc

scan_grid_size.y = cub::DivideAndRoundUp(num_tiles, max_dim_x);

)

This is a duplicate of #245

select_if kernel needs grid boundary or reprogramming tile_idx