stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Node arguments not parsed properly for torch.distributed.launch

YianZhang opened this issue · comments

Describe the bug
I was using the command below but according to the log, gradient_accumulation_steps was set to 8, and the training was slower than expected. In the config, effective_bsz is set to 512.

To Reproduce
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train.py --config conf/dense.yaml --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 64 --run_id train_dense_multiprocess --training_arguments.per_device_eval_batch_size 64

Expected behavior
gradient_accumulation_steps should equal 512/64/8 = 1.

Additional context
One possible reason might be that the arguments were not parsed correctly. I printed the value of quinfig.nproc_per_node and quinfig.nnodes in train.py. The values were incorrect. I hacked it by explicitly setting quinfig.nproc_per_node = 8 and quinfig.nnodes = 1 in train.py and it seemed to work. I am not sure this is the only cause of the bug or whether my hack completely solved the issue.

I think we need to formally deprecate torch.distributed.launch?