16lemoing / ccvs

CCVS: Context-aware Controllable Video Synthesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CCVS - Official PyTorch Implementation

Code for NeurIPS'21 paper CCVS: Context-aware Controllable Video Synthesis.

CCVS: Context-aware Controllable Video Synthesis
Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Paper: https://arxiv.org/abs/2107.08037
Project page: https://16lemoing.github.io/ccvs

Abstract: This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Installation

The code is tested with pytorch 1.7.0 and python 3.8.6

To install dependencies with conda run:

conda env create -f env.yml
conda activate ccvs

To install apex run:

cd tools
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../..

Prepare datasets

BAIR Robot Pushing - (Repo) - (License)

Create corresponding directory:

mkdir datasets/bairhd

Download the high resolution data from this link and put it in the new directory, then run:

tar -xvf datasets/bairhd/softmotion_0511.tar.gz -C datasets/bairhd

Preprocess BAIR dataset for resolution 256x256:

python data/scripts/preprocess_bairhd.py --data_root datasets/bairhd --dim 256

We also provide our annotation tool to later estimate the (x,y) position of the arm:

python data/scripts/annotate_bairhd.py --data_root datasets/bairhd/original_frames_256 --out_dir datasets/bairhd/annotated_frames

Kinetics-600 - (Repo) - (License)

This dataset is a collection of YouTube links from which we extract the corresponding train and test videos running:

mkdir datasets/kinetics
wget https://storage.googleapis.com/deepmind-media/Datasets/kinetics600.tar.gz -P datasets/kinetics
tar -xvf datasets/kinetics/kinetics600.tar.gz -C datasets/kinetics
python  data/scripts/download_kinetics.py datasets/kinetics/kinetics600/train.csv datasets/kinetics/kinetics600/train_videos --trim
python  data/scripts/download_kinetics.py datasets/kinetics/kinetics600/test.csv datasets/kinetics/kinetics600/test_videos --trim

Preprocess the dataset:

python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/train_videos --out_root datasets/kinetics/preprocessed_videos --out_name train_64p_square_32t --max_vid_len 32 --resize 64 --square_crop
python data/scripts/preprocess_kinetics.py --src_folder datasets/kinetics/kinetics600/test_videos --out_root datasets/kinetics/preprocessed_videos --out_name test_64p_square_32t --max_vid_len 32 --resize 64 --square_crop

Split the data into folds and precompute metadata for faster training/testing:

python data/scripts/compute_folds_kinetics.py train 100 64p_square_32t
python data/scripts/compute_folds_kinetics.py test 40 64p_square_32t --max_per_fold 1248

AudioSet-Drums - (Repo) - (License) - (License of curated version)

Create corresponding directory:

mkdir datasets/drums

Download the data from this link and run:

unzip datasets/drums/AudioSet_Drums.zip -d datasets/drums

UCF101 - (Repo)

Create corresponding directory:

mkdir datasets/ucf101

Download the data from this link and run:

mkdir datasets/ucf101/videos
unrar e datasets/ucf101/UCF101.rar datasets/ucf101/videos

Training

BAIR Robot Pushing

First, train the frame autoencoder:

bash scripts/bairhd/train_frame_autoencoder.sh

Then, train the transformer for different tasks (one should change --q_load_path in the corresponding files to point to the checkpoints of the trained autoencoder) :

  • Video prediction
bash scripts/bairhd/train_transformer.sh
  • Point-to-point synthesis
bash scripts/bairhd/train_transformer_p2p.sh
  • State-conditioned synthesis (this requires to train a state estimator first and change the corresponding --s_load_path before training the transformer)
bash scripts/bairhd/train_state_estimator.sh
bash scripts/bairhd/train_transformer_state.sh
  • Unconditional synthesis
bash scripts/bairhd/train_transformer_unc.sh

Kinetics-600

The same applies, e.g., for video prediction:

bash scripts/kinetics/train_frame_autoencoder.sh
bash scripts/kinetics/train_transformer.sh

UCF101

The same applies, e.g., for video prediction:

bash scripts/ucf101/train_frame_autoencoder.sh
bash scripts/ucf101/train_transformer.sh

AudioSet-Drums

For audio-conditioned synthesis, we train two encoders (one to compress frames, the other to compress sound features) and then train the transformer:

bash scripts/drums/train_frame_autoencoder.sh
bash scripts/drums/train_stft_autoencoder.sh
bash scripts/drums/train_transformer_audio.sh

Inference

We provide checkpoints for various configurations:

Dataset Future prediction Point-to-point synthesis State-conditioned synthesis Sound-conditioned synthesis Unconditional synthesis Download
BAIR Robot Pushing checkpoint
Kinetics-600 checkpoint
UCF101 checkpoint
AudioSet-Drum checkpoint

Extract checkpoints with the following command (by replacing CKPT.zip with the corresponding name).

unzip CKPT.zip -d checkpoints/

Synthesize videos from downloaded checkpoints.

BAIR Robot Pushing

bash scripts/bairhd/save_videos_state_off.sh
bash scripts/bairhd/save_videos_p2p.sh
bash scripts/bairhd/save_videos_state_on.sh
bash scripts/bairhd/save_videos_unc.sh

Kinetics-600

bash scripts/kinetics600/save_videos.sh
bash scripts/kinetics600/save_videos_p2p.sh

UCF101

bash scripts/ucf101/save_videos.sh

AudioSet-Drums

bash scripts/drums/save_videos_audio_off.sh
bash scripts/drums/save_videos_audio_on.sh

Here are some important flags:

  • --vid_len: the total number of frames in synthetic videos (including conditioning frames)
  • --x_cond_len: the length of tokens corresponding to conditioning frames. In the preceding experiments one frame is represented by 64 tokens so one can set this flag to 0 for unconditionnal synthesis, 64 for one input frame, 128 for two...
  • --keep_state: add this flag in sound- or state- conditioned synthesis to effectvely use the control (otherwise sound / state are also predicted)

Evaluation

After inference, compute evaluation metrics with the following commands:

python tools/tf_fvd/fvd.py --exp_tag TAG
python tools/pytorch_metrics/metrics.py --exp_tag TAG

where TAG is the name of the directory (inside results/ folder) under which videos where saved during inference. The first command computes the Fréchet video distance (FVD), and second one the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM). One can use the --idx flag to compute PSNR / SSIM for specific timesteps.

Citation

If you find this code useful in your research, please consider citing:

@inproceedings{lemoing2021ccvs,
  title     = {{CCVS}: Context-aware Controllable Video Synthesis},
  author    = {Guillaume Le Moing and Jean Ponce and Cordelia Schmid},
  booktitle = {NeurIPS},
  year      = {2021}
}

Acknowledgments

This code borrows from StyleGAN2, minGPT, pytorch-liteflownet and VQVAE.

License

CCVS is released under the MIT license.

About

CCVS: Context-aware Controllable Video Synthesis

License:MIT License


Languages

Language:Python 90.7%Language:Shell 5.8%Language:Cuda 3.1%Language:C++ 0.4%