Why the use of flash attention in the inference stage will lead to slower?
xyfZzz opened this issue · comments
Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.
Line 62 in a824bda
@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.