glassroom / heinsen_attention

Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)

Home Page:http://arxiv.org/abs/2404.05843

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Linear attention

shaochenze opened this issue · comments

Hi Franz, I still feel that the method proposed can be classified as a variant of linear attention. To elaborate, the softmax-log function can be simplified to a form of linear normalization: $$softmax(log(X)){ij}=\frac{X_{ij}}{\sum_{j}X_{ij}}.$$ Consequently, Equation 1 aligns with the instance where $$X=exp(Q)exp(K)^T.$$

Assuming that the output of Equation 1 is 'V', we can then express it as: $$V'_i = \frac{\sum_j exp(Q_i)^Texp(K_j) V_j}{\sum_j exp(Q_i)^Texp(K_j)},$$

which is the equation 4 in [1] with φ=exp.

[1] https://arxiv.org/pdf/2006.16236.pdf

Thank you for posting your comment here! That way everyone can benefit from our discussion.

the softmax-log function can be simplified to a form of linear normalization

Ah, I see what you mean. Yes, that's correct! Applying the composition $\text{Softmax} \circ \log$ is in fact what makes this work. Once you see it, it's obvious in hindsight.

So yes, you're right: eq. (1) on my preprint is expressible as a variant of linear attention. I will update the FAQs shortly to reflect as much. Thank you again!

PS. @shaochenze , I added a link to your comment in the README. Thank you again!