Single machine multi-GPU training

Question

Single machine multi-GPU training

AlexNmSED opened this issue 2 years ago · comments

When I use 4 GPUS in single machine , I meet this question:
runtimeerror: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] connectruntclosed by peer [172.16.173.129]:23211

Someone can help me ?

Thank you .

andrew · Answer 1 · Mon Mar 20 2023 17:53:57 GMT+0800 (China Standard Time)

try this:
python -m torch.distributed.launch --nproc_per_node=4 main_pretrain.py

AlexNmSED · Answer 2 · Mon Mar 20 2023 21:09:53 GMT+0800 (China Standard Time)

Thank you. But that's what I do.