Single machine multi-GPU training
AlexNmSED opened this issue · comments
AlexNmSED commented
When I use 4 GPUS in single machine , I meet this question:
runtimeerror: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] connectruntclosed by peer [172.16.173.129]:23211
Someone can help me ?
Thank you .
andrew commented
try this:
python -m torch.distributed.launch --nproc_per_node=4 main_pretrain.py
AlexNmSED commented
Thank you. But that's what I do.