Node arguments not parsed properly for torch.distributed.launch

Question

Node arguments not parsed properly for torch.distributed.launch

YianZhang opened this issue 2 years ago · comments

Describe the bug
I was using the command below but according to the log, gradient_accumulation_steps was set to 8, and the training was slower than expected. In the config, effective_bsz is set to 512.

To Reproduce
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train.py --config conf/dense.yaml --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 64 --run_id train_dense_multiprocess --training_arguments.per_device_eval_batch_size 64

Expected behavior
gradient_accumulation_steps should equal 512/64/8 = 1.

Additional context
One possible reason might be that the arguments were not parsed correctly. I printed the value of quinfig.nproc_per_node and quinfig.nnodes in train.py. The values were incorrect. I hacked it by explicitly setting quinfig.nproc_per_node = 8 and quinfig.nnodes = 1 in train.py and it seemed to work. I am not sure this is the only cause of the bug or whether my hack completely solved the issue.

David Hall · Answer 1 · Wed Jun 01 2022 02:13:11 GMT+0800 (China Standard Time)

I think we need to formally deprecate torch.distributed.launch?