torch.distributed.elastic.multiprocessing.api

Question

torch.distributed.elastic.multiprocessing.api

MrD005 opened this issue 2 months ago · comments

Dev Goel commented 2 months ago

Traceback (most recent call last):
File "/root/anaconda3/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-16_16:40:07
host : e2e-84-47.ssdcloudindia.net
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 28813)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 28813

Jiatong (Julius) Han · Answer 1 · Wed Apr 17 2024 03:22:01 GMT+0800 (China Standard Time)

It could be due to the mismatch between cuda and pytorch versions. Run nvcc --version and python -c 'import torch; print(torch.version.cuda);' to see if they match.

erichtho · Answer 2 · Wed Apr 17 2024 11:18:22 GMT+0800 (China Standard Time)

Same error.
and nvcc --version matches python -c 'import torch; print(torch.version.cuda);'

erichtho · Answer 3 · Wed Apr 17 2024 15:19:38 GMT+0800 (China Standard Time)

Same error. and nvcc --version matches python -c 'import torch; print(torch.version.cuda);'

also found with dmesg:

Dev Goel · Answer 4 · Wed Apr 17 2024 16:20:14 GMT+0800 (China Standard Time)

same
nvcc -v and python -c 'import torch; print(torch.version.cuda);' return same cuda version

11.8

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

erichtho · Answer 5 · Wed Apr 17 2024 19:30:10 GMT+0800 (China Standard Time)

hi, I downgrade torch to 2.1.2 and resolve the problem(also changed xformers version to v0.0.23.post1).
here is how I locate problem:
1.debug with pdb, found is torch.nn.Conv3d raise segmentation fault
2.searched and got a known issue, which says it is an oneDNN upgrade issue, pytorch 2.1.2 can work, check: pytorch/pytorch#120406

hope helpful.

Jiatong (Julius) Han · Answer 6 · Wed Apr 17 2024 21:24:24 GMT+0800 (China Standard Time)

Thanks for sharing @erichtho . Would this solve your issue as well? @MrD005

Dev Goel · Answer 7 · Thu Apr 18 2024 04:00:46 GMT+0800 (China Standard Time)

thanks it solved the problem

torch.distributed.elastic.multiprocessing.api

scripts/inference.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-16_16:40:07 host : e2e-84-47.ssdcloudindia.net rank : 0 (local_rank: 0) exitcode : -11 (pid: 28813) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 28813

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-16_16:40:07
host : e2e-84-47.ssdcloudindia.net
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 28813)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 28813