DDP Example RuntimeError: CUDA error: unknown error

Question

DDP Example RuntimeError: CUDA error: unknown error

JingchengYang4 opened this issue 2 years ago · comments

Jingcheng Yang commented 2 years ago

I tried to run the command following the tutorial https://github.com/pytorch/examples/tree/main/distributed/ddp

torchrun --nnode=1 --node_rank=0 --nproc_per_node=8 example.py --local_world_size=8

For the distributed example distributed/ddp/example.py

I received the following error

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[4456] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '4', 'WORLD_SIZE': '8'}
[4454] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[4452] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[4458] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '6', 'WORLD_SIZE': '8'}
[4455] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
[4459] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '7', 'WORLD_SIZE': '8'}
[4457] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '5', 'WORLD_SIZE': '8'}
[4453] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[4453]: world_size = 8, rank = 1, backend=nccl 
[4453] rank = 1, world_size = 8, n = 1, device_ids = [0] 
[4457]: world_size = 8, rank = 5, backend=nccl 
[4457] rank = 5, world_size = 8, n = 1, device_ids = [0] 
[4454]: world_size = 8, rank = 2, backend=nccl 
[4454] rank = 2, world_size = 8, n = 1, device_ids = [0] 
[4455]: world_size = 8, rank = 3, backend=nccl 
[4455] rank = 3, world_size = 8, n = 1, device_ids = [0] 
[4456]: world_size = 8, rank = 4, backend=nccl 
[4456] rank = 4, world_size = 8, n = 1, device_ids = [0] 
[4458]: world_size = 8, rank = 6, backend=nccl 
[4458] rank = 6, world_size = 8, n = 1, device_ids = [0] 
[4459]: world_size = 8, rank = 7, backend=nccl 
[4459] rank = 7, world_size = 8, n = 1, device_ids = [0] 
[4452]: world_size = 8, rank = 0, backend=nccl 
[4452] rank = 0, world_size = 8, n = 1, device_ids = [0] 
Traceback (most recent call last):
  File "example.py", line 97, in <module>
    spmd_main(args.local_world_size, args.local_rank)
  File "example.py", line 83, in spmd_main
    demo_basic(local_world_size, local_rank)
  File "example.py", line 38, in demo_basic
    model = ToyModel().cuda(device_ids[0])
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 97, in <module>
    spmd_main(args.local_world_size, args.local_rank)
  File "example.py", line 83, in spmd_main
    demo_basic(local_world_size, local_rank)
  File "example.py", line 38, in demo_basic
    model = ToyModel().cuda(device_ids[0])
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 97, in <module>
    spmd_main(args.local_world_size, args.local_rank)
  File "example.py", line 83, in spmd_main
    demo_basic(local_world_size, local_rank)
  File "example.py", line 38, in demo_basic
    model = ToyModel().cuda(device_ids[0])
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 689, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4452 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4453 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4454 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4456 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4458 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 4455) of binary: /home/jingcheng/anaconda3/envs/Depth5/bin/python
Traceback (most recent call last):
  File "/home/jingcheng/anaconda3/envs/Depth5/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jingcheng/anaconda3/envs/Depth5/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-25_22:20:57
  host      : Jingcheng-Ubuntu2
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 4457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-08-25_22:20:57
  host      : Jingcheng-Ubuntu2
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 4459)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-25_22:20:57
  host      : Jingcheng-Ubuntu2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 4455)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(Depth5) jingcheng@Jingcheng-Ubuntu2:~/Downloads/examples-main/distributed/ddp$

I'm using PyTorch 1.12.1 with CUDA 11.6 on Ubuntu 20

Does anyone know how I can resolve this issue? Thanks.