Multiple nodes slurm training

Question

Multiple nodes slurm training

Salvatore-tech opened this issue 2 years ago · comments

Good morning,
I have read the documentation about finetuning and i'd like to lauch train.py to finetune PwcNet on my dataset loading a checkpoint file.
I have created a script to load the dataset under /configs/base/datasets and used in the config script under /configs/pwcnet.

I'd like to reduce the training time, using all the resource i have available on a small cluster with 4 nodes and 4 GPU (NVLINK) NVIDIA Tesla V100 32GB SXM2.
Do you see any rooms for improvements with the following command?
srun -p xgpu --job-name=pwc_kitti --gres=gpu:4 --ntasks=16 --ntasks-per-node=4 --cpus-per-task=2 --kill-on-bad-exit=1 python -u tools/train.py $MMFLOW/configs/pwcnet/pwcnet_ft_4x1_300k_kitti_320x896.py --work-dir=$MMFLOW/work_dir/pwckitti --launcher=slurm

It estimates about 1 day to complete finetuning, do you think i'm using all 4 nodes correctly?
If so, can i reduce the training iterations to require less time?
Thanks in advance!

Zachary-66 · Answer 1 · Tue Sep 20 2022 15:00:56 GMT+0800 (China Standard Time)

Your command can use all your resources.

If you want to reduce the training time, you can reduce the number of iterations. The config in pwcnet_ft_4x1_300k_kitti_320x896.py follows the settings of the original paper, which means the batch size is 4.
According to your command, your batch size is increased to 16, so there is no need to train as many times as the original paper.

Besides, according to my experience in using SGD, when the batch size is increased by a factor of 4, the learning rate should also be increased by a factor of 4. This change can help speed up the convergence. But the optical flow task uses Adam as the optimizer, so I'm not sure if this strategy will still work well, you can have a try.

salvatorestarace · Answer 2 · Mon Sep 26 2022 04:28:37 GMT+0800 (China Standard Time)

Thanks @Zachary-66 for your answer, i'm closing it.