Node arguments not parsed properly for torch.distributed.launch
YianZhang opened this issue · comments
Describe the bug
I was using the command below but according to the log, gradient_accumulation_steps
was set to 8, and the training was slower than expected. In the config, effective_bsz
is set to 512.
To Reproduce
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train.py --config conf/dense.yaml --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 64 --run_id train_dense_multiprocess --training_arguments.per_device_eval_batch_size 64
Expected behavior
gradient_accumulation_steps
should equal 512/64/8 = 1.
Additional context
One possible reason might be that the arguments were not parsed correctly. I printed the value of quinfig.nproc_per_node
and quinfig.nnodes
in train.py
. The values were incorrect. I hacked it by explicitly setting quinfig.nproc_per_node = 8
and quinfig.nnodes = 1
in train.py and it seemed to work. I am not sure this is the only cause of the bug or whether my hack completely solved the issue.
I think we need to formally deprecate torch.distributed.launch?