HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MonarchMixerLayer

jeohalves opened this issue · comments

Hello,

I've come across an algorithm in the paper that appears to be designed for the M2 layer, with the intention of replacing both the Attention and MLP layers (specifically the nn.Linear part of the latter).

However, upon examining the monarch_mixer_sequence_mixer.py script, I noticed that it uses Hyena filters, and I couldn't find any implementation of this M2 layer algorithm in the code.

I might be missing something, but I wanted to clarify if it's necessary to substitute the Hyena filters with the M2 layer.

Thank you for your assistance with this project!

P.S.: I'm currently working with image data.

Great question!

If you look at section 5.1, we use the Monarchs to implement long convolutions in conjunction with gating for a lot of the backbones. (also see this image from the blog).

I think with image data, you might actually be fine without the Hyena stuff - it's more important for language. In older experiments, we do see higher performance on ImageNet with the gating and the Hyena kernels. On CIFAR, the performance is about the same with/without the gating.