hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Home Page:https://hpcaitech.github.io/Open-Sora/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torch.distributed.elastic.multiprocessing.api

MrD005 opened this issue · comments

Traceback (most recent call last):
File "/root/anaconda3/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-16_16:40:07
host : e2e-84-47.ssdcloudindia.net
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 28813)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 28813

It could be due to the mismatch between cuda and pytorch versions. Run nvcc --version and python -c 'import torch; print(torch.version.cuda);' to see if they match.

Same error.
and nvcc --version matches python -c 'import torch; print(torch.version.cuda);'
截屏2024-04-17 11 16 13

Same error. and nvcc --version matches python -c 'import torch; print(torch.version.cuda);' 截屏2024-04-17 11 16 13

also found with dmesg:
截屏2024-04-17 15 19 03

same
nvcc -v and python -c 'import torch; print(torch.version.cuda);' return same cuda version

11.8

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

hi, I downgrade torch to 2.1.2 and resolve the problem(also changed xformers version to v0.0.23.post1).
here is how I locate problem:
1.debug with pdb, found is torch.nn.Conv3d raise segmentation fault
2.searched and got a known issue, which says it is an oneDNN upgrade issue, pytorch 2.1.2 can work, check: pytorch/pytorch#120406

hope helpful.

Thanks for sharing @erichtho . Would this solve your issue as well? @MrD005

thanks it solved the problem