torch.distributed.elastic.multiprocessing.api
MrD005 opened this issue · comments
Traceback (most recent call last):
File "/root/anaconda3/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/inference.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-04-16_16:40:07
host : e2e-84-47.ssdcloudindia.net
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 28813)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 28813
It could be due to the mismatch between cuda and pytorch versions. Run nvcc --version
and python -c 'import torch; print(torch.version.cuda);'
to see if they match.
same
nvcc -v
and python -c 'import torch; print(torch.version.cuda);'
return same cuda version
11.8
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
hi, I downgrade torch to 2.1.2 and resolve the problem(also changed xformers version to v0.0.23.post1).
here is how I locate problem:
1.debug with pdb, found is torch.nn.Conv3d raise segmentation fault
2.searched and got a known issue, which says it is an oneDNN upgrade issue, pytorch 2.1.2 can work, check: pytorch/pytorch#120406
hope helpful.
thanks it solved the problem