VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

Official pytorch implementation of our CVPR 2021 paper VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples.

Overview

Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. We introduce generative adversarial learning to improve the temporal robustness of the encoder. We use a generator to temporally drop out several frames from this sample. The discriminator is then learned to encode similar feature representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we propose a temporally adversarial decay to model key attenuation in the memory queue when computing the contrastive loss. It largely improves MoCo with temporally robust feature representation for the video-related classification/recognition scenario, which is novel in terms of both temporal representation learning methodology and video scenario.

Requirements

pytroch >= 1.3.0
tensorboard
cv2
kornia

Usage

Data preparation

Download the Kinetics400 dataset from the official website.
Download the UCF101 dataset from the official website.

Train

Note that we take 100 epochs to train D for initialization, and then train G and D via adversarial learning for the remaining 100 epochs.

python train.py \
  --log_dir $your/log/path\
  --ckp_dir $your/checkpoint/path\
  -a r2plus1d_18 \
  --lr 0.005 \
  -fpc 32 \
  -b 32 \
  -j 128 \
  --epochs 200 \
  --schedule  120 160 \
  --dist_url 'tcp://localhost:10001' --multiprocessing_distributed --world_size 1 --rank 0 \
  --resume ./checkpoint_0100.pth.tar \
  $kinetics400/dataset/path

Pretrained Model

100 epochs for initialization: https://drive.google.com/file/d/1tE20ZNPg9l882900eXU0HcOc36UtwS5Y/view?usp=sharing

r2d18_200epoch（UCF101 Acc@1 82.518）: https://drive.google.com/file/d/1DzA5Yn43x9ZuirX2jV8CuhhqCnSYky0x/view?usp=sharing

Action Recognition Evaluation

python eval.py

Acknowledgement

Our code is based on the implementation of "MoCo: Momentum Contrast for Unsupervised Visual Representation Learning" (See the citation below), including the implementation of the momentum encoder and queue. If you use our code, please also cite their work as below.

Citation

If our code is helpful to your work, please cite:

@inproceedings{pan2021videomoco,
  title={Videomoco: Contrastive video representation learning with temporally adversarial examples},
  author={Pan, Tian and Song, Yibing and Yang, Tianyu and Jiang, Wenhao and Liu, Wei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={11205--11214},
  year={2021}
}

@inproceedings{he2020momentum,
  title={Momentum contrast for unsupervised visual representation learning},
  author={He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={9729--9738},
  year={2020}
}

tinapan-pt / VideoMoCo