Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache

Question

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache

hxer7963 opened this issue 2 months ago · comments

In MLA, the KVCache compresses $h_t$ into $C_t^{KV} \in \mathbb{R}^{d_c}$, and to circumvent the issue of incompatibility with RoPE for low-rank KVCache compression, it concatenates $k_t^R = \text{RoPE}(W^{KR}h_t) \in \mathbb{R}^{d_h^R}$.

However, according to equation (17): $k_{t,i}=[k_{t,i}^C; k_t^R]$, during the computation of attention, $k_t^c = W^{UK}C_t^{KV} \in \mathbb{R}^{d_hn_h}$ is used instead of $C_t^{KV}$.

Appendix B mentions that by applying the associative law of matrix multiplication, $W^{DKV}$ can be absorbed into $W^Q$: $W^Q[W^{UK}(W^{DKV}h_t)] = (W^QW^{UK})(W^{DKV}h_t)=(W^{UQ})C_t^{KV}$.

Questions:

Given that $W^Q \in \mathbb{R}^{d_hn_h \times d}$ and $W^{UK} \in \mathbb{R}^{d_hn_h \times d_c}$, how are these matrices multiplied to derive $W^{UQ}$?
How are the values for the matrices $W^{DKV}, W^{UK}, W^{KR}$ computed? Appendix B suggests that these are calculated offline once and not during training as part of the low-rank matrix values.

Any insights or detailed explanations regarding these points would be highly appreciated.

Fuli Luo · Answer 1 · Tue May 14 2024 13:03:18 GMT+0800 (China Standard Time)

Here's a recommended blog for you: https://spaces.ac.cn/archives/10091 @hxer7963