Some minor inconsistency in the paper

Question

Some minor inconsistency in the paper

radarFudan opened this issue 9 months ago · comments

According to your arxiv paper (https://arxiv.org/pdf/2310.12109.pdf), there is no activation in sequence mixing in formula (2). However, the appendix includes the code for MonarchMixerLayer and it includes the ReLU in the sequence mixing layer.

Shida Wang · Answer 1 · Thu Oct 26 2023 18:38:53 GMT+0800 (China Standard Time)

Plus, equation (3) seems to be an MLP operation (I might be wrong). I don't fully understand why it is said to be MLP-free in "The resulting architecture is entirely attention- and MLP-free."

Dan Fu · Answer 2 · Fri Oct 27 2023 03:16:59 GMT+0800 (China Standard Time)

Thanks for your questions!

it includes the ReLU in the sequence mixing layer

This is a typo - we used to say that the sequence mixing had an "optional" activation function that we would set to identity for the sequence mixer. We updated the equation but not the pseudocode -- will fix it the next time we update the arXiv!

equation (3) seems to be a MLP operation

Ah, this is helpful feedback! The distinction we intend to say is that an MLP is quadratic in $d$, and the M2 version is sub-quadratic.

Specifically -- in an MLP, there are linear layers that take quadratic compute ($O(d^2)$ for dimension $d$). In M2, we replace these linear layers with Monarch matrices, which can be computed in sub-quadratic time. So it has the similar structure of an MLP, but is sub-quadratic in $d$.

We'll clarify the language in the next arXiv update, thank you!