Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py

Question

Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py

480284856 opened this issue 7 months ago · comments

I'm in the main branch, and the checkout is also the latest: c67bbab.
my launch command is:

python multigpu.py 30 5

error is below:

[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 104, in <module>
    mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 90, in main
    trainer = Trainer(model, train_data, optimizer, rank, save_every)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/distributed/ddp-tutorial-series/multigpu.py", line 38, in __init__
    self.model = DDP(model, device_ids=[gpu_id])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392026823/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'

my env:
python 3.11.5 h955ad1f_0
pytorch 2.1.0 py3.11_cuda11.8_cudnn8.7.0_0 pytorch

container launch command:

docker run -itd \
        --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
        --shm-size '32g' \
        --volume ${PWD}:/workspace \
        --workdir /workspace \
        --name pytorch_examples \
        nvidia/cuda:11.8.0-devel-ubuntu22.04

Thanks in advance :)