BGMV performs better than SGMV?
opened this issue · comments
I benchmarked various kernels on the A100 using the benchmark script, and it seems that the BGMV kernel outperforms the SGMV kernels for individual requests (bgmv senario). Is this expected?
![Screenshot 2024-01-31 at 4 28 27 PM](https://private-user-images.githubusercontent.com/84881952/301348310-cabe1c29-1dfc-4b9a-830f-471d94f9fd34.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxMTU3MjUsIm5iZiI6MTcyMjExNTQyNSwicGF0aCI6Ii84NDg4MTk1Mi8zMDEzNDgzMTAtY2FiZTFjMjktMWRmYy00YjlhLTgzMGYtNDcxZDk0ZjlmZDM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI3VDIxMjM0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY3NTczYTI1YjAwZThhZDhlODEzMGE2NmE5NjVkNzk5MmY4OTMwNzgzMTMzZTU1OGZiY2U0MjE0YmVmMjI5NGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.fI4DwiY9qkzQA_rdTYl9NGnpv52W8F9zh-ohSeqVT-4)
Hi @jsheng-jian , thanks for doing the benchmark and yes it's somewhat expected considering the current SGMV implementation is not optimized for individual requests. A better implementation of SGMV (we are integrating them into flashinfer) may have a similar performance to bgmv but I don't expect sgmv would be faster in this case.