Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion.

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, Jing Liao

TL;DR: Direct-a-Video is a text-to-video generation framework that allows users to individually or jointly control the camera movement and/or object motion.

Setup

You may create a new environment:

conda create --name dav python=3.8
conda activate dav

The required python packages are listed in requirements.txt, you can install these packages by running :

pip install -r requirements.txt

Download pretrained models

Text-to-video base model

We use Zeroscope_v2_576w as our base model, you can cache it to locally by running the following python code.

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)

Camera module

Based on text-to-video base model, we additionally trained a camera module that enables camera motion control. The camera module is available at OneDrive or GoogleDrive. Please download it and save to the ./ckpt directory.

Run Inference

We prepared two ways to run the inference - either using python notebook or using our qt5-based UI. See instructions below:

Python notebook

Refer to the inference.ipynb, follow the step-by-step instructions and comments inside.

UI

We also designed a UI (which is based on pyqt5) for interactive use. Here are the instructions:

Run the ui launching script ./ui/main_ui.py, make sure your machine supports graphics display if you are running on a remote server.
```
 python ui/main_ui.py
```
you'll see the interface below
Input your prompt in Section A. Instructions on prompt:
- Use * to mark the object(s) word and the background word (optional), just append * right after the word. For example, "a tiger* and a bear* walking in snow*"
- If an object has more than one word, use ( ) to wrap them. E.g., a (white tiger) walking in (green grassland)"
- The mark * and ( ) can be used together, e.g., a tiger* and a (bear) walking in (green grassland)"
- The marked background word (if any) should always be the last marked word, as seen in the above examples.
[optional] Camera motion: set camera movement parameters in Section B, remember to check the enable box first!
[optional] Object Motion: draw object motion boxes in Section C:
- Check the enable box at the top to enable this function.
- On blank canvas, left-click and drag the mouse to draw a starting box, release the mouse, then left-click and drag the mouse to draw an ending box.
- Upon releasing the mouse, a straight path will be automatically generated between starting box and ending box. You can right-click somewhere to adjust this path.
- You can click "add object" button to add another box pair.
[optional] You can change the random seed in section D, we do not recommend changing the video resolution.
In Section E, click initialize the model to initialize the models (done once only before generation).
After initialization is done, click Generate video button , wait for a while and the output results will be displayed. You can go back to step 3 or 4 or 5 to adjust the input and hyperparamters then generate again.

Some tips:
- If the model generates tightly packed box-shape objects, try to increase the attention amplifcation timestep in Section C to higher value like 0.98. You can also decrease the amplifcation weight to lower values like 10.
- Initial noise is important to output, try with different seeds if not get desired output.

Training

Dataset

We use a static shot subset of Movieshot for training the camera motion. We first download the dataset, we then use BLIP-2 to generate caption for each video. Finally, we make the training data in csv format, see ./data/example_train_data.csv for example.

Start training

The main training script for camera motion is train_cam.py. You may want to go through it before running. We prepared a bash file train_cam_launcher.sh, where you can set the arguments for launching the training script using Accelerator. We list some useful arguments:

--output_dir: the directory to save training outputs, including validation samples, and checkpoints.

--train_data_csv: csv file containing training data, see './data/example_train_data.csv' for example.

--val_data_csv: csv file containing validation data, see './data/example_val_data.csv' for example.

--n_sample_frames: number of video frames

--h: video height

--w: video width

--validation_interval: how many iters to run validation set

--checkpointing_interval: how many iters to save ckpt

--mixed_precision: can be one of 'no' (i.e.,fp32), 'fp16', or 'bf16' (only on certain GPUs)

--gradient_checkpointing: enable this to save memory

After setting up, run the bash script to launch the training:

bash train_cam_launcher.sh

Citation

@inproceedings{dav24,
  author = {Shiyuan Yang and Liang Hou and Haibin Huang and Chongyang Ma and Pengfei Wan and Di Zhang and Xiaodong Chen and Jing Liao},
  title = {Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion},
  booktitle = {Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '24 (SIGGRAPH Conference Papers '24)},
  year = {2024},
  location = {Denver, CO, USA},
  date = {July 27--August 01, 2024},
  publisher = {ACM},
  address = {New York, NY, USA},
  pages = {12},
  doi = {10.1145/3641519.3657481},
}

Acknowlegment

This repo is mainly built on Text-to-video diffusers pipeline. Some code snippets in the repo were borrowed from GLGEN diffusers repo and DenseDiff repo.

ysy31415 / direct_a_video