HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bert-like implementation

lhallee opened this issue · comments

Hello,

Amazing work!!!

I have a couple of questions regarding the bidirectional implementation of the model.

  1. Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?
  2. It seems like the input u is after a token embedding layer. Do you add positional embeddings?
  3. Is any sort of attention mask required?
  4. Is it really okay to say M2 outperforms BERT when trained on different data? I think C4 improves BERT base considerably if I remember correctly.

Best,
Logan

Thanks for your questions and your interest!

Does the MonarchMixerSequenceMixing have by default all the recommending settings used in the training of the bert-like models (obviously with bidirectional = True). If not, is it possible to share the settings used for the M2 large?

You can find all the settings for M2-BERT-large here (260m) and here (341m).

There's also detailed instructions to run everything in this README.

It seems like the input u is after a token embedding layer. Do you add positional embeddings?

We add positional embeddings at the very beginning of the architecture, see here.

Is any sort of attention mask required?

Nope!

M2 comparisons on the same data

We found that C4 was more reliable of a training source - we have head-to-head comparisons trained from scratch with Wiki and Books with an older version of the architecture (no gating, and minus the residual conv) in the Appendix B.9 of the paper. We match BERT pretrained with that recipe, but we found it wasn't as good for downstream fine-tuning.

Really interested into diving more into these questions - and I also suspect that the optimal training recipe for an M2-style model will be pretty different from Transformers (where the recipe has been fine-tuned for ~6 years now).

Thanks so much for the response. I am planning on building the model and a custom pytorch training loop for my own data; I work on biological sequences and we are always length-limited with traditional attention. If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

Additionally, I have been messing around on a machine without a GPU and got the import error telling me about requirements-cpu.txt, but cannot find that file anywhere.
ImportError: Please make sure to pip install -r requirements-cpu.txt to get the requirements for the BERT benchmark.

Thanks for pointing my attention towards Appendix B.9! That is some really compelling data!

May I ask why the choice of 30% MLM? This is so interesting.

I work on biological sequences and we are always length-limited with traditional attention.

Biological sequences are super interesting to us! Feel free to reach out privately if you want to discuss more, my email's on my website :)

If I understand it correctly the create_bert_mlm function returns a fully functional huggingface-wrapped M2 mixer bert?

I would call it HuggingFace-esque... it has a similar interface as the HuggingFace BERT MLM models, but we haven't implemented all the HuggingFace interfaces (just the equivalent of BertForSequenceClassification for GLUE).

I have been messing around on a machine without a GPU

For CPU, I've had some success using these Docker images and point-installing individual packages (basically try running something, and then if it complains install that individual package).

30% MLM

We found that this just makes it learn a bit faster in terms of steps you need (Mosaic found something similar).

Awesome, I will reach out separately to chat more :) Thanks again!