Possible performance boost with 2-pass online softmax

Question

Possible performance boost with 2-pass online softmax

zixuanweeei opened this issue a month ago · comments

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

Engininja2 · Answer 1 · Sat May 18 2024 11:08:52 GMT+0800 (China Standard Time)

Another paper that presents a similar 2-pass softmax algorithm is https://arxiv.org/abs/2001.04438 though their focus is on CPUs.

I tried implementing it for CUDA/HIP to see what the performance would look like. On a RX 5700XT test-backend-ops for softmax was between a ~0.93-1.02 speedup compared to master depending on the case. On a GTX 1050 performance was much worse, generally around 0.80 speedup with some outliers in both directions.

For very large tensors that need global memory instead of shared, which aren't compiled in test-backend-ops by default, performance was around 20% to 40% faster than master for both GPUs.

You can look at this branch if you're interested: https://github.com/Engininja2/llama.cpp/tree/2pass-softmax

Wei, Zixuan · Answer 2 · Tue May 21 2024 10:17:40 GMT+0800 (China Standard Time)

Hi @Engininja2 . Thanks for the comments. I tried an initial implementation based on the online normalizer in https://arxiv.org/abs/1805.02867 which has a better performance compared to https://arxiv.org/abs/2001.04438 on almost all defaults in test-backend-ops. You can give a try if you're interested: https://github.com/zixuanweeei/llama.cpp/tree/zx/two-pass-softmax .