hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

Home Page:https://hpcaitech.github.io/Open-Sora/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

root cause: no __init__.py file under shardformer folder.

Get-David opened this issue · comments

          root cause: no __init__.py file under shardformer folder.

Originally posted by @BountyMage in #232 (comment)
我有同样的错误,我加了init一样报错/home/zdw/Open-Sora/opensora/acceleration/shardformer/__init__.py

(opensora) zdw@ai-gpu-server149:~/Open-Sora$ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x512x512.py --data-path /home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
Config (path: configs/opensora/train/16x512x512.py): {'num_frames': 16, 'frame_interval': 3, 'image_size': (512, 512), 'root': None, 'data_path': '/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv', 'use_image_transform': False, 'num_workers': 4, 'dtype': 'bf16', 'grad_checkpoint': False, 'plugin': 'zero2', 'sp_size': 1, 'model': {'type': 'STDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'from_pretrained': '/home/zdw/Open-Sora/pre_training/Open-Sora/OpenSora-v1-HQ-16x512x512.pth', 'enable_flashattn': False, 'enable_layernorm_kernel': False}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/sd-vae-ft-ema', 'micro_batch_size': 128}, 'text_encoder': {'type': 't5', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/t5-v1_1-xxl', 'model_max_length': 120, 'shardformer': True}, 'scheduler': {'type': 'iddpm', 'timestep_respacing': ''}, 'seed': 42, 'outputs': 'outputs', 'wandb': False, 'epochs': 1000, 'log_every': 10, 'ckpt_every': 500, 'load': None, 'batch_size': 8, 'lr': 2e-05, 'grad_clip': 1.0, 'multi_resolution': False}
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[04/02/24 15:13:50] INFO     colossalai - colossalai - INFO:                                                                                
                             /data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                          
[2024-04-02 15:13:50] Experiment directory created at outputs/010-F16S3-STDiT-XL-2
[2024-04-02 15:13:50] Dataset contains 1 videos (/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv)
[2024-04-02 15:13:50] Total batch size: 8
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:45<00:00, 22.79s/it]
Traceback (most recent call last):
  File "/home/zdw/Open-Sora/scripts/train.py", line 287, in <module>
    main()
  File "/home/zdw/Open-Sora/scripts/train.py", line 132, in main
    text_encoder = build_module(cfg.text_encoder, MODELS, device=device)  # T5 must be fp32
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
    return builder.build(cfg)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 287, in __init__
    self.shardformer_t5()
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 292, in shardformer_t5
    from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
ModuleNotFoundError: No module named 'opensora.acceleration.shardformer'
[2024-04-02 15:14:41,917] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2216794) of binary: /data/share8/zdw/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
  File "/data/share8/zdw/miniconda3/envs/opensora/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-02_15:14:41
  host      : ai-gpu-server149
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2216794)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(opensora) zdw@ai-gpu-server149:~/Open-Sora$ 

有重新再install一遍吗?

Creat one and reinstall it

有重新再install一遍吗?

重新安装什么?

Creat one and reinstall it

Reinstall what, the conda environment, or what?

Creat one and reinstall it

Reinstall what, the conda environment, or what?

only opensora

Creat one and reinstall it

Reinstall what, the conda environment, or what?

only opensora

I tried the following command again, but it still prompted an error

git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .

The error is reported as follows

(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x512x512.py --data-path /home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv

/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
Config (path: configs/opensora/train/16x512x512.py): {'num_frames': 16, 'frame_interval': 3, 'image_size': (512, 512), 'root': None, 'data_path': '/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv', 'use_image_transform': False, 'num_workers': 4, 'dtype': 'bf16', 'grad_checkpoint': False, 'plugin': 'zero2', 'sp_size': 1, 'model': {'type': 'STDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'from_pretrained': '/home/zdw/Open-Sora/pre_training/Open-Sora/OpenSora-v1-HQ-16x512x512.pth', 'enable_flashattn': True, 'enable_layernorm_kernel': True}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/sd-vae-ft-ema', 'micro_batch_size': 128}, 'text_encoder': {'type': 't5', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/t5-v1_1-xxl', 'model_max_length': 120, 'shardformer': True}, 'scheduler': {'type': 'iddpm', 'timestep_respacing': ''}, 'seed': 42, 'outputs': 'outputs', 'wandb': False, 'epochs': 1000, 'log_every': 10, 'ckpt_every': 500, 'load': None, 'batch_size': 4, 'lr': 2e-05, 'grad_clip': 1.0, 'local_rank': 0, 'multi_resolution': False}
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[04/02/24 17:47:00] INFO     colossalai - colossalai - INFO:                                                                                             
                             /data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch                   
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                                       
[2024-04-02 17:47:00] Experiment directory created at outputs/008-F16S3-STDiT-XL-2
[2024-04-02 17:47:00] Dataset contains 1 videos (/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv)
[2024-04-02 17:47:00] Total batch size: 4
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00,  8.71s/it]
Traceback (most recent call last):
  File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 287, in <module>
    main()
  File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 132, in main
    text_encoder = build_module(cfg.text_encoder, MODELS, device=device)  # T5 must be fp32
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
    return builder.build(cfg)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 287, in __init__
    self.shardformer_t5()
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 292, in shardformer_t5
    from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
ModuleNotFoundError: No module named 'opensora.acceleration.shardformer'
[2024-04-02 17:47:22,023] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2289970) of binary: /data/share8/zdw/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
  File "/data/share8/zdw/miniconda3/envs/opensora/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-02_17:47:22
  host      : ai-gpu-server149
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2289970)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$ 

我是把acceleration下的空的init.py考到shardformer目录下的。

我是把acceleration下的空的init.py考到shardformer目录下的。

我按照你的操作,仍然会报No module named 'opensora.acceleration.shardformer'

看起来是因为没有__init__.py文件,find_packages查找的时候跳过shardformer了,在opensora/acceleration/shardformer下加一个空__init__.py然后重新pip install -v .

This issue is stale because it has been open for 7 days with no activity.

This issue was closed because it has been inactive for 7 days since being marked as stale.