NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question regarding block launch order in CUDA

Snektron opened this issue · comments

The CUDA C programming guide mentions on page 13:

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.

However, in the code for agent scan there is this comment:

// Blocks are launched in increasing order, so just assign one tile per
// block
// Current tile index
int tile_idx = start_tile + blockIdx.x;

Does this mean that scan is using undefined behavior here, or have I missed some specification of CUDA?