NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

select_if kernel needs grid boundary or reprogramming tile_idx

zhaolianshuizls opened this issue · comments

As far as I understood, select_if kernel implements a inter-block barrier to ensure the previous tile is finished before reading prefix sum from it. Since there is no guarantee the thread block is scheduled to execute in blockIdx order (https://forums.developer.nvidia.com/t/performance-cost-of-too-many-blocks/67982/13), it is possible that lower blockIdx block is not issued to run in the very first round across the device, which results in hang or other failures. So I think the following might be helpful.

  1. As the linked discussion pointed out, use an atomic global variable to redefine tile_idx (
    int tile_idx = (blockIdx.x * gridDim.y) + blockIdx.y; // Current tile index
    )
  2. Or the grid launch should guarantee that the number of blocks should be small enough to be resident on the device at a time (
    scan_grid_size.y = cub::DivideAndRoundUp(num_tiles, max_dim_x);
    )

This is a duplicate of #245