NVIDIA / cub

The CUDA C programming guide mentions on page 13:

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.

However, in the code for agent scan there is this comment:

cub/cub/agent/agent_scan.cuh

Lines 408 to 412 in 5d12837

    
           // Blocks are launched in increasing order, so just assign one tile per 
        
           // block 
        
           // Current tile index 
        
           int tile_idx = start_tile + blockIdx.x;

Does this mean that scan is using undefined behavior here, or have I missed some specification of CUDA?

	// Blocks are launched in increasing order, so just assign one tile per
	// block

	// Current tile index
	int tile_idx = start_tile + blockIdx.x;

Question regarding block launch order in CUDA