Make decoupled look-back delay part of tuning

Question

Make decoupled look-back delay part of tuning

gevtushenko opened this issue a year ago · comments

Preliminary tuning of rle, select, partition, scan, *by_key algorithms shows that delays in decoupled look-back spin-loops depend on the data type and the algorithm. Before tuning these algorithms, we should make CUB_DETAIL_L2_BACKOFF_NS and CUB_DETAIL_L2_WRITE_LATENCY_NS part of algorithm-dependent tuning policy.

Ilya Grebnov · Answer 1 · Tue May 23 2023 01:12:57 GMT+0800 (China Standard Time)

You could also try exponential back off. I practically found that it works reasonable well on my Turing and Ada Lovelace GPUs.

https://github.com/IlyaGrebnov/libcubwt/blob/main/libcubwt.cu#L820

One downside is that this approach slightly increases registers pressure and in my case it was enough to cause spill, but I circumvented this with some blocks rearrangement.

Georgii Evtushenko · Answer 2 · Tue May 23 2023 04:03:10 GMT+0800 (China Standard Time)

@IlyaGrebnov thank you for your comment!

I remember trying exponential back off, and it was always worse than static delay. I think it might be related to the effect you mentioned, so I'll try it again as part of tuning infrastructure to figure out it there's a different CTA size / items per thread configuration that allows exponential back off to outperform static delay.

Regardless of the results, exponential back off is not applicable to CUB_DETAIL_L2_WRITE_LATENCY_NS, so it would have to be extracted into tuning policy.