Multi-GPU training inside docker

Question

Multi-GPU training inside docker

williamhyin opened this issue 3 years ago · comments

HI ,

Thanks for your code release.
I have a question about Multi-GPU training command.
Is it possible to train with Multi-GPU(8) inside docker?

Like:python -m torch.distributed.launch --nproc_per_node 8 train.py xxx

Multi-GPU training outside docker by using the following command is not so comfortable for server training :

make docker-run-mpi COMMAND="".

I am looking forward to your Reply.
And thanks again for your great job!

Dennis Park · Answer 1 · Fri Sep 24 2021 01:43:53 GMT+0800 (China Standard Time)

Thanks for the interest @williamhyin. By default, we only support the multi-gpu training via make docker-run-mpi ... . It should be possible to modify train.py to make with work with the pytorch launcher. We will have a look at this, if there are a number of use cases.

HeroyiuWFY · Answer 2 · Thu Oct 21 2021 16:13:28 GMT+0800 (China Standard Time)

HI， I also want to know how to train with Multi-GPUs by using python -m torch.distributed.launch --nproc_per_node 8 train.py xxx
Looking forward to your Reply and thanks again for your great job!

YiNanChen · Answer 3 · Fri Oct 29 2021 11:10:45 GMT+0800 (China Standard Time)

Hi, guys! It's easily to training with multi-gpu without docker. After install all the requirements, just run the command CUDA_VISIBLE_DEVICES="x,x,x,x" mpirun -np ${num_gpus} ./script/train.py +experiments=dd3d_kitti_dla34.yaml will start training with multi-gpu.

xiaoquan wang · Answer 4 · Wed Dec 29 2021 22:23:30 GMT+0800 (China Standard Time)

Hi, @revisitq
I met the error

mpirun was unable to launch the specified application as it could not access
or execute an executable:
Executable: ./script/train.py
Node: shaxbw06
while attempting to start process rank 0.

The command line is

CUDA_VISIBLE_DEVICES=5,7 mpirun -np 2 ./script/train.py +experiments=dd3d_kitti_dla34.yaml

YiNanChen · Answer 5 · Thu Dec 30 2021 10:46:01 GMT+0800 (China Standard Time)

Hi, @revisitq I met the error

mpirun was unable to launch the specified application as it could not access
or execute an executable:
Executable: ./script/train.py
Node: shaxbw06
while attempting to start process rank 0.

The command line is

CUDA_VISIBLE_DEVICES=5,7 mpirun -np 2 ./script/train.py +experiments=dd3d_kitti_dla34.yaml

Make sure you install the dependence follow dockerfile
Check your command, it should be CUDA_VISIBLE_DEVICES="5,7" mpirun -np 2 ./script/train.py +experiments=dd3d_kitti_dla34.yaml

J L · Answer 6 · Wed May 18 2022 00:03:14 GMT+0800 (China Standard Time)

@williamhyin you can build conda env by youself and run by mpirun -n 8 python scripts/train.py +experiments=dd3d_kitti_dla34