Tune-A-Video

This repository is the official implementation of Tune-A-Video.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Given a video-text pair, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.

News

[02/03/2023] Checkout our latest results tuned on Modern Disney and Redshift.
[01/28/2023] New Feature: tune a video on personalized DreamBooth models.
[01/28/2023] Code released!

Setup

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4, v2-1). You can also use fine-tuned Stable Diffusion models trained on different styles (e.g, Modern Disney, Redshift, etc.).

[DreamBooth] DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few images (3~5 images) of a subject. Tuning a video on DreamBooth models allows personalized text-to-video generation of a specific subject. There are some public DreamBooth models available on Hugging Face (e.g., mr-potato-head). You can also train your own DreamBooth model following this training example.

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"

Note: Tuning a video usually takes 300~500 steps, about 5~10 minutes using one A100 GPU and 10~20 minutes using one V100 GPU.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-4"
unet_model_path = "./outputs/man-surfing/2023-XX-XXTXX-XX-XX"
unet = UNet3DConditionModel.from_pretrained(unet_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"./{prompt}.gif")

Results

Stable Diffusion


[Training] a man is surfing.	a panda is surfing.	Iron Man is surfing in the desert.	a raccoon is surfing, cartoon style.

Mr Potato Head


[DreamBooth] sks mr potato head.	sks mr potato head, wearing a pink hat, is surfing.	sks mr potato head, wearing sunglasses, is surfing.	sks mr potato head is surfing in the forest.

Modern Disney


[Training] a bear is playing guitar.	a handsome prince is playing guitar, modern disney style.	a magical princess is playing guitar on the beach, modern disney style.	a rabbit is playing guitar, modern disney style.

Redshift


[Training] a man is skiing.	spider man is skiing.	bat man is skiing.	hulk is skiing.

Citation

If you make use of our work, please cite our paper.

@article{wu2022tuneavideo,
    title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year={2022}
}

Shoutouts

This code builds on diffusers. Thanks for open-sourcing!
Thanks hysts for the awesome gradio demo.

costiash / Tune-A-Video