HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some minor inconsistency in the paper

radarFudan opened this issue · comments

According to your arxiv paper (https://arxiv.org/pdf/2310.12109.pdf), there is no activation in sequence mixing in formula (2). However, the appendix includes the code for MonarchMixerLayer and it includes the ReLU in the sequence mixing layer.

Plus, equation (3) seems to be an MLP operation (I might be wrong). I don't fully understand why it is said to be MLP-free in "The resulting architecture is entirely attention- and MLP-free."

Thanks for your questions!

it includes the ReLU in the sequence mixing layer

This is a typo - we used to say that the sequence mixing had an "optional" activation function that we would set to identity for the sequence mixer. We updated the equation but not the pseudocode -- will fix it the next time we update the arXiv!

equation (3) seems to be a MLP operation

Ah, this is helpful feedback! The distinction we intend to say is that an MLP is quadratic in $d$, and the M2 version is sub-quadratic.

Specifically -- in an MLP, there are linear layers that take quadratic compute ($O(d^2)$ for dimension $d$). In M2, we replace these linear layers with Monarch matrices, which can be computed in sub-quadratic time. So it has the similar structure of an MLP, but is sub-quadratic in $d$.

We'll clarify the language in the next arXiv update, thank you!