[Performance]: Will memcpy happen with distributed kv caches while decoding ?

Question

[Performance]: Will memcpy happen with distributed kv caches while decoding ?

GodHforever opened this issue 25 days ago · comments

Proposal to improve performance

vLLM中在prefill阶段是计算完整的tensor，然后拷贝到kv cache中以备一个阶段使用。
我比较好奇在decode阶段，产生较多token后如果key cache 中的token不是连续存储的话，在attention计算时会进行完整的拷贝吗？还是一直都是一个token一个token计算呢？如果是前者的话是否会由于内存拷贝而增加耗时，如果是后者的话会不会降低计算效率呢。
不知道是否有比较了解的兄弟介绍下。我在源代码的调试中没有找到相关的部分，还没来得及深入探索。

In the prefill stage of vLLM, a complete tensor is computed and then copied to the kv cache for use in a later phase.
I am quite curious about the decoding stage, where, after generating a significant number of tokens, if the tokens in the key cache are not stored consecutively, will there be a complete copy made during attention calculation? Or is the calculation always done one token at a time? If it's the former, wouldn't the memory copy increase the time cost? And if it's the latter, wouldn't this reduce computational efficiency?
I wonder if there is anyone with knowledge on this matter who could provide some insights. I haven't found relevant parts in the source code during debugging, and haven't had the chance to explore in depth yet.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`