stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speed up pre-training

yandachen opened this issue · comments

Hello, I'm working on a project that involves pre-training GPT-2 Medium. Currently using your code (deepspeed + bf16 + flash attention) it took around 15 days to pre-train for the full 400K steps on 4 A100 GPUs. Do you have any suggestions on possible approaches to further speed up pre-training by e.g., 2x?

One possible solution I'm thinking of is to increase the learning rate. Looks like GPT-2 medium uses a learning rate of 1.5e-4. Did you guys experiment with a larger learning rate? Was the model able to converge faster during pre-training without losing too much of the perplexity?

Any suggestion would be very appreciated!