ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible performance boost with 2-pass online softmax

zixuanweeei opened this issue · comments

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit.
It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

Another paper that presents a similar 2-pass softmax algorithm is https://arxiv.org/abs/2001.04438 though their focus is on CPUs.

I tried implementing it for CUDA/HIP to see what the performance would look like. On a RX 5700XT test-backend-ops for softmax was between a ~0.93-1.02 speedup compared to master depending on the case. On a GTX 1050 performance was much worse, generally around 0.80 speedup with some outliers in both directions.

For very large tensors that need global memory instead of shared, which aren't compiled in test-backend-ops by default, performance was around 20% to 40% faster than master for both GPUs.

You can look at this branch if you're interested: https://github.com/Engininja2/llama.cpp/tree/2pass-softmax

Hi @Engininja2 . Thanks for the comments. I tried an initial implementation based on the online normalizer in https://arxiv.org/abs/1805.02867 which has a better performance compared to https://arxiv.org/abs/2001.04438 on almost all defaults in test-backend-ops. You can give a try if you're interested: https://github.com/zixuanweeei/llama.cpp/tree/zx/two-pass-softmax .