Speed up pre-training
yandachen opened this issue · comments
Hello, I'm working on a project that involves pre-training GPT-2 Medium. Currently using your code (deepspeed + bf16 + flash attention) it took around 15 days to pre-train for the full 400K steps on 4 A100 GPUs. Do you have any suggestions on possible approaches to further speed up pre-training by e.g., 2x?
One possible solution I'm thinking of is to increase the learning rate. Looks like GPT-2 medium uses a learning rate of 1.5e-4. Did you guys experiment with a larger learning rate? Was the model able to converge faster during pre-training without losing too much of the perplexity?
Any suggestion would be very appreciated!