Open-Sora Plan

[Project Page] [中文主页] [Discord] [Wechat Group]

Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "CloseAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, we have limited resources, we deeply wish all open-source community can contribute to this project. Pull request are welcome!!!

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前我们资源有限仅搭建了基础架构，无法进行完整训练，希望通过开源社区逐步增加模块并筹集资源进行训练，当前版本离目标差距巨大，仍需持续完善和快速迭代，欢迎Pull request！！！

Project stages:

Primary

Setup the codebase and train a un-conditional model on landscape dataset.
Train models that boost resolution and duration.

Extensions

Conduct text2video experiments on landscape dataset.
Train the 1080p model on video2text dataset.
Control model with more condition.

News

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, welcome to pull request.

[2024.03.04] We re-organize and modulize our codes and make it easy to contribute to the project, please see the Repo structure.

[2024.03.03] We open some discussions and clarify several issues.

[2024.03.01] Training codes are available now! Learn more in our project page. Please feel free to watch 👀 this repository for the latest updates.

Todo

Setup the codebase and train a unconditional model on landscape dataset

Train models that boost resolution and duration

Add PI to support out-of-domain size. 🙏 [Need your contribution]
Add 2D RoPE to improve generalization ability as FiT. 🙏 [Need your contribution]
Extract offline feature.
Add frame interpolation model. 🤝 Thanks to @yunyangge
Add super resolution model. 🤝 Thanks to @Linzy19
Add accelerate to automatically manage training.
Joint training with images. 🙏 [Need your contribution]
Incorporate NaViT. 🙏 [Need your contribution]
Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

Finish data loading, pre-processing utils. ⌛ [WIP]
Add CLIP and T5 support. ⌛ [WIP]
Add text2image training script. ⌛ [WIP]
Add prompt captioner. 🙏 [Need your contribution]

Train the 1080p model on video2text dataset

Control model with more condition

Load pretrained weight from PixArt-α. ⌛ [WIP]
Incorporating ControlNet. 🙏 [Need your contribution]

Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   └── super_resolution
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

Requirements and Installation

The recommended requirements are as follows.

Python >= 3.8
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Usage

Datasets

Refer to Data.md

Video-VQVAE (VideoGPT)

Training

To train VQVAE, run the script:

scripts/train_vqvae.sh

You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:

VQ-VAE Specific Settings

--embedding_dim: number of dimensions for codebooks embeddings
--n_codes 2048: number of codes in the codebook
--n_hiddens 240: number of hidden features in the residual blocks
--n_res_layers 4: number of residual blocks
--downsample "4,4,4": T H W downsampling stride of the encoder

Dataset Settings

--data_path <path>: path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
--resolution 128: spatial resolution to train on
--sequence_length 16: temporal resolution, or video clip length

Reconstructing

python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1

python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1

We present four reconstructed videos in this demonstration, arranged from left to right as follows:

3s 596x336	10s 256x256	18s 196x196	24s 168x96

Others

Please refer to the document VQVAE.

VideoDiT (DiT)

Training

sh scripts/train.sh

Sampling

sh scripts/sample.sh

How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

Acknowledgement

Latte: The main codebase we built upon and it is an wonderful video gererated model.
DiT: Scalable Diffusion Models with Transformers.
VideoGPT: Video Generation using VQ-VAE and Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

License

The service is a research preview intended for non-commercial use only. See LICENSE for details.

About

This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.

MIT License

Languages

Language:Python 99.6%Language:Shell 0.4%