Some minor inconsistency in the paper
radarFudan opened this issue · comments
According to your arxiv paper (https://arxiv.org/pdf/2310.12109.pdf), there is no activation in sequence mixing in formula (2). However, the appendix includes the code for MonarchMixerLayer and it includes the ReLU in the sequence mixing layer.
Plus, equation (3) seems to be an MLP operation (I might be wrong). I don't fully understand why it is said to be MLP-free in "The resulting architecture is entirely attention- and MLP-free."
Thanks for your questions!
it includes the ReLU in the sequence mixing layer
This is a typo - we used to say that the sequence mixing had an "optional" activation function that we would set to identity for the sequence mixer. We updated the equation but not the pseudocode -- will fix it the next time we update the arXiv!
equation (3) seems to be a MLP operation
Ah, this is helpful feedback! The distinction we intend to say is that an MLP is quadratic in
Specifically -- in an MLP, there are linear layers that take quadratic compute (
We'll clarify the language in the next arXiv update, thank you!