Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py
480284856 opened this issue · comments
顾小杰 commented
I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:
python multigpu.py 30 5
error is below:
[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 104, in <module>
mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 90, in main
trainer = Trainer(model, train_data, optimizer, rank, save_every)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 38, in __init__
self.model = DDP(model, device_ids=[gpu_id])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392026823/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch
container launch command:
docker run -itd \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
--shm-size '32g' \
--volume ${PWD}:/workspace \
--workdir /workspace \
--name pytorch_examples \
nvidia/cuda:11.8.0-devel-ubuntu22.04
Thanks in advance :)