ModuleNotFoundError: No module named 'torch._six'
IMJONEZZ opened this issue · comments
I installed on Ubuntu using the instructions in the README.
Everything installed correctly, but when I attempt to run using this command:
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt
I get this traceback:
[04/05/24 13:17:15] INFO colossalai - colossalai - INFO:
/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py
:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:41<00:00, 20.93s/it]
Traceback (most recent call last):
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 33, in get_layernorm
from apex.normalization import FusedLayerNorm
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/apex/__init__.py", line 8, in <module>
from . import amp
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/apex/amp/__init__.py", line 1, in <module>
from .amp import init, half_function, float_function, promote_function,\
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/apex/amp/amp.py", line 5, in <module>
from .frontend import *
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/apex/amp/frontend.py", line 2, in <module>
from ._initialize import _initialize
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/apex/amp/_initialize.py", line 2, in <module>
from torch._six import string_classes
ModuleNotFoundError: No module named 'torch._six'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/Open-Sora/scripts/inference.py", line 112, in <module>
main()
File "/home/user/Open-Sora/scripts/inference.py", line 58, in main
model = build_module(
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
return builder.build(cfg)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 385, in STDiT_XL_2
model = STDiT(depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 181, in __init__
[
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 182, in <listcomp>
STDiTBlock(
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 56, in __init__
self.norm1 = get_layernorm(hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 37, in get_layernorm
raise RuntimeError("FusedLayerNorm not available. Please install apex.")
RuntimeError: FusedLayerNorm not available. Please install apex.
[2024-04-05 13:18:01,454] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1417) of binary: /home/user/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
File "/home/user/miniconda3/envs/opensora/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/inference.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-05_13:18:01
host : DESKTOP-G7PO0IO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1417)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
It says that FusedLayerNorm isn't available and to install apex, but apex is correctly installed.
I found this issue: microsoft/DeepSpeed#2845
It looks like torch._six is deprecated.
After deleting the environment and the repo and starting from scratch, after installing everything over again I now have this error:
ModuleNotFoundError: No module named 'colossalai'
This is obviously a problem because of this:
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import colossalai
>>> print(colossalai.__version__)
0.3.6
ModuleNotFoundError: No module named 'colossalai'
It could be a problem of having ambiguity in the python you used for installing colossalai
. Can you please show which python
?
This issue is stale because it has been open for 7 days with no activity.