HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

can i use MonarchMixer replace cross attention lay

autumn-2-net opened this issue · comments

The Sequence Mixer in the paper doesn't seem to be able to mix unequal lengths of sequences in the same way as corss attention.because it uses elementwise multiplication.Is this a misunderstanding on my part or is Monarch Mixer not a replacement for cross attention?

This is something we're very interested in and still working on! We don't have a formula for it quite yet.

This is something we're very interested in and still working on! We don't have a formula for it quite yet.

This doesn't sound like good news, it looks like I'll just have to CROSS ATTENTION mix MonarchMixer, is there a performance loss compared to raw ATTENTION?

We've seen that we can match self-attention in quality with some gated convolutions (see the paper for details). Cross attention is still an open problem - which we'll be working on!

We've seen that we can match self-attention in quality with some gated convolutions (see the paper for details). Cross attention is still an open problem - which we'll be working on!

If I use M2 can I not use positional coding as I feel that M2 looks a bit similar to conv which allows the model to know the positional information