Make decoupled look-back delay part of tuning
gevtushenko opened this issue · comments
Preliminary tuning of rle, select, partition, scan, *by_key algorithms shows that delays in decoupled look-back spin-loops depend on the data type and the algorithm. Before tuning these algorithms, we should make CUB_DETAIL_L2_BACKOFF_NS
and CUB_DETAIL_L2_WRITE_LATENCY_NS
part of algorithm-dependent tuning policy.
You could also try exponential back off. I practically found that it works reasonable well on my Turing and Ada Lovelace GPUs.
https://github.com/IlyaGrebnov/libcubwt/blob/main/libcubwt.cu#L820
One downside is that this approach slightly increases registers pressure and in my case it was enough to cause spill, but I circumvented this with some blocks rearrangement.
@IlyaGrebnov thank you for your comment!
I remember trying exponential back off, and it was always worse than static delay. I think it might be related to the effect you mentioned, so I'll try it again as part of tuning infrastructure to figure out it there's a different CTA size / items per thread configuration that allows exponential back off to outperform static delay.
Regardless of the results, exponential back off is not applicable to CUB_DETAIL_L2_WRITE_LATENCY_NS
, so it would have to be extracted into tuning policy.