NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make decoupled look-back delay part of tuning

gevtushenko opened this issue · comments

Preliminary tuning of rle, select, partition, scan, *by_key algorithms shows that delays in decoupled look-back spin-loops depend on the data type and the algorithm. Before tuning these algorithms, we should make CUB_DETAIL_L2_BACKOFF_NS and CUB_DETAIL_L2_WRITE_LATENCY_NS part of algorithm-dependent tuning policy.

You could also try exponential back off. I practically found that it works reasonable well on my Turing and Ada Lovelace GPUs.

https://github.com/IlyaGrebnov/libcubwt/blob/main/libcubwt.cu#L820

One downside is that this approach slightly increases registers pressure and in my case it was enough to cause spill, but I circumvented this with some blocks rearrangement.

@IlyaGrebnov thank you for your comment!

I remember trying exponential back off, and it was always worse than static delay. I think it might be related to the effect you mentioned, so I'll try it again as part of tuning infrastructure to figure out it there's a different CTA size / items per thread configuration that allows exponential back off to outperform static delay.

Regardless of the results, exponential back off is not applicable to CUB_DETAIL_L2_WRITE_LATENCY_NS, so it would have to be extracted into tuning policy.