DachengLi1 / LongChat

Official repository for LongChat and LongEval

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why the use of flash attention in the inference stage will lead to slower?

xyfZzz opened this issue · comments

commented

Hi, I've seen eval scripts mention that using flash attention will be slower. I am wondering why the use of flash attention in the inference stage will lead to slower, because I have the impression that flash attention can speed up.

parser.add_argument("--longchat_flash_attn", action='store_true', help="Only apply to longchat models. Whether to enable flash attention to save memory, but slower.")

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

commented

@xyfZzz Flash attention by itself does not support ky_cache, so it is recomputing the cache again and again in our naive implementation. We have a member on vLLM team working on better support.

I understand. Thank you for your explanation!