Linear attention

Question

Linear attention

shaochenze opened this issue 6 months ago · comments

Hi Franz, I still feel that the method proposed can be classified as a variant of linear attention. To elaborate, the softmax-log function can be simplified to a form of linear normalization: $$softmax(log(X)){ij}=\frac{X_{ij}}{\sum_{j}X_{ij}}.$$ Consequently, Equation 1 aligns with the instance where $$X=exp(Q)exp(K)^T.$$

Assuming that the output of Equation 1 is 'V', we can then express it as: $$V'_i = \frac{\sum_j exp(Q_i)^Texp(K_j) V_j}{\sum_j exp(Q_i)^Texp(K_j)},$$

which is the equation 4 in [1] with φ=exp.

[1] https://arxiv.org/pdf/2006.16236.pdf

F. Heinsen · Answer 1 · Thu Apr 11 2024 01:58:32 GMT+0800 (China Standard Time)

Thank you for posting your comment here! That way everyone can benefit from our discussion.

the softmax-log function can be simplified to a form of linear normalization

Ah, I see what you mean. Yes, that's correct! Applying the composition $\text{Softmax} \circ \log$ is in fact what makes this work. Once you see it, it's obvious in hindsight.

So yes, you're right: eq. (1) on my preprint is expressible as a variant of linear attention. I will update the FAQs shortly to reflect as much. Thank you again!

F. Heinsen · Answer 2 · Thu Apr 11 2024 03:55:31 GMT+0800 (China Standard Time)

PS. @shaochenze , I added a link to your comment in the README. Thank you again!