Generating Visually Aligned Sound from Videos

This is the official pytorch implementation of the TIP paper "Generating Visually Aligned Sound from Videos" and the corresponding Visually Aligned Sound (VAS) dataset.

Demo videos containing sound generation results can be found here.

Updates

We release the pre-computed features for the testset of Dog category, together with the pre-trained RegNet. You can use them for generating dog sounds by yourself. (23/11/2020)

Usage Guide
Other Info
- Citation
- Contact

Usage Guide

Getting Started

[back to top]

Installation

Clone this repository into a directory. We refer to that directory as REGNET_ROOT.

git clone https://github.com/PeihaoChen/regnet
cd regnet

Create a new Conda environment.

conda create -n regnet python=3.7.1
conda activate regnet

Install PyTorch and other dependencies.

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0
conda install ffmpeg -n regnet -c conda-forge
pip install -r requirements.txt

Download Datasets

In our paper, we collect 8 sound types (Dog, Fireworks, Drum, Baby form VEGAS and Gun, Sneeze, Cough, Hammer from AudioSet) to build our Visually Aligned Sound (VAS) dataset. Please first download VAS dataset and unzip the data to $REGNET_ROOT/data/ folder.

For each sound type in AudioSet, we download all videos from Youtube and clean data on Amazon Mechanical Turk (AMT) using the same way as VEGAS.

unzip ./data/VAS.zip -d ./data

Data Preprocessing

Run data_preprocess.sh to preprocess data and extract RGB and optical flow features.

Notice: The script we provided to calculate optical flow is easy to run but is resource-consuming and will take a long time. We strongly recommend you to refer to TSN repository and their built docker image (our paper also uses this solution) to speed up optical flow extraction and to restrictly reproduce the results.

source data_preprocess.sh

Training RegNet

Training the RegNet from scratch. The results will be saved to ckpt/dog.

CUDA_VISIBLE_DEVICES=7 python train.py \
save_dir ckpt/dog \
auxiliary_dim 32 \ 
rgb_feature_dir data/features/dog/feature_rgb_bninception_dim1024_21.5fps \
flow_feature_dir data/features/dog/feature_flow_bninception_dim1024_21.5fps \
mel_dir data/features/dog/melspec_10s_22050hz \
checkpoint_path ''

In case that the program stops unexpectedly, you can continue training.

CUDA_VISIBLE_DEVICES=7 python train.py \
-c ckpt/dog/opts.yml \
checkpoint_path ckpt/dog/checkpoint_018081

Generating Sound

During inference, our RegNet will generate visually aligned spectrogram, and then use WaveNet as vocoder to generate waveform from spectrogram. You should first download our trained WaveNet model for different sound categories ( Dog, Fireworks, Drum, Baby, Gun, Sneeze, Cough, Hammer ).

The generated spectrogram and waveform will be saved at ckpt/dog/inference_result

CUDA_VISIBLE_DEVICES=7 python test.py \
-c ckpt/dog/opts.yml \ 
aux_zero True \ 
checkpoint_path ckpt/dog/checkpoint_041000 \ 
save_dir ckpt/dog/inference_result \
wavenet_path /path/to/wavenet_dog.pth

If you want to train your own WaveNet model, you can use WaveNet repository.

git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
git checkout 2092a64

Pre-trained RegNet

You can also use our pre-trained RegNet and pre-computed features for generating visually aligned sounds.

First, download and unzip the pre-computed features (Dog) to ./data/features/dog folder.

cd ./data/features/dog
tar -xvf features_dog_testset.tar # unzip

Second, download and unzip our pre-trained RegNet (Dog) to ./ckpt/dog folder.

cd ./ckpt/dog
tar -xvf ./ckpt/dog/RegNet_dog_checkpoint_041000.tar # unzip

Third, run the inference code.

CUDA_VISIBLE_DEVICES=0 python test.py \
-c config/dog_opts.yml \ 
aux_zero True \ 
checkpoint_path ckpt/dog/checkpoint_041000 \ 
save_dir ckpt/dog/inference_result \
wavenet_path /path/to/wavenet_dog.pth

Enjoy your experiments!

Other Info

[back to top]

Citation

Please cite the following paper if you feel RegNet useful to your research

@Article{chen2020regnet,
  author  = {Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang and Chuang Gan},
  title   = {Generating Visually Aligned Sound from Videos},
  journal = {TIP},
  year    = {2020},
}

Contact

For any question, please file an issue or contact

Peihao Chen: phchencs@gmail.com
Hongdong Xiao: xiaohongdonghd@gmail.com

PeihaoChen / regnet