HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for projecting pre-trained BERT weights into Monarch matrices

sinamps opened this issue · comments

Hello, I would like to know if you have published the code to project the pre-trained weights of the BERT model into Monarch matrices. I cannot locate the code for this (I have also looked in the fly repo).
I can see the projection functions here, but I am interested in knowing how you use them specifically for BERT (or other transformers for NLP) to go from pre-trained weights to Monarch matrices. Thank you very much.

Ah, we don't actually use those in our work - that file was just copy-pasted from the fly repo. In M2 we're training everything from scratch, since the gated convolutional layers are quite different in function from an attention layer. It would be interesting to figure out how to distill an attention layer into a gated convolution!

Thank you for your prompt response @DanFu09. Would you happen to have any pointers on how that was done in the fly work? I am already working with those projection functions from the fly repo, but I want to make sure I correctly reproduce the results.