Details of Training

Question

Details of Training

achen46 opened this issue 2 years ago · comments

Hi @alihassanijr , thanks for the great repository. For reproducing your results, how many nodes were used to train these models ? I see that config files are provided for each model, but wonder if any changes are needed if trained on multi-node.

Ali Hassani · Answer 1 · Tue Apr 19 2022 22:56:24 GMT+0800 (China Standard Time)

Hi and thank you for your interest.
We tried both single and multi node settings, but did not notice any significant difference between the two.
If you want to train on multiple nodes, you'd have to divide the batch size to keep it consistent.
All of the models we've released were trained with a batch size of 1024, which is 128 samples per GPU, hence the 128 in the config files. If you increase the number of GPUs, you'd have to decrease batch size (i.e. 2 nodes with 16 GPUs -> batch size 64).

I hope this clarifies things.

achen46 · Answer 2 · Wed Apr 20 2022 01:51:54 GMT+0800 (China Standard Time)

Thanks for clarification. I will try to reproduce your numbers. Great work !