DachengLi1 / LongChat

Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.

LongChat/longeval/eval.py

Line 62 in a824bda

    
           parser.add_argument("--longchat_flash_attn", action='store_true', help="Only apply to longchat models. Whether to enable flash attention to save memory, but slower.")

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

I understand. Thank you for your explanation!

Why the use of flash attention in the inference stage will lead to slower？