mayuelala / SimVTP

SimVTP: This repo is the official implementation of "Simple Video Text Pre-training with Masked Autoencoders"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸš€ SimVTP: Simple Video Text Pre-training with Masked Autoencoders

PyTorch License Python

πŸƒ Abstract

SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words.

Our SimVTP has several properties:

  • Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens.
  • SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance.
  • Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly.
  • SimVTP is data-efficient, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M.

teaser

πŸ”₯ Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

Method Vis Enc.Init Pre-trained Data #pairs R@1 R@5 R@10 MdR
HERO ImageNet, Kinetics HowTo100M 136M 16.8 43.4 57.7 -
AVLnet ImageNet, Kinetics HowTo100M 136M 27.1 55.6 66.6 4
Frozen ImageNet WebVid2M+CC3M 5.5M 31.0 59.5 70.5 3
OATrans ImageNet WebVid2M+CC3M 5.5M 35.8 63.4 76.5 3
RegionLearner ImageNet WebVid2M+CC3M 5.5M 36.3 63.9 72.5 3
LocVTP ImageNet WebVid2M+CC3M 5.5M 36.5 64.3 76.8 3
BFormer ImageNet WebVid2M+CC3M 5.5M 37.6 64.8 75.1 3
SimVTP(ours) Kinetics WebVid2M 2.5M 53.6 82.8 90.8 1

πŸ”¨ Dependencies and Installation

  • Python >= 3.6
  • PyTorch >= 1.6.0
  • NVIDIA GPU + CUDA

β›Ί Installation

  1. Clone repo
    git clone git@github.com:mayuelala/SimVTP.git
    cd SimVTP
  2. Install dependent packages
    pip install -r requirements.txt

πŸ”… Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

🌿 Pre-training

We pretrain our SimVTP on video dataset WebVid-2M with 64 V100 GPU (8 nodes x 8 GPUs). The implementation of our SimVTP supports multi-node distributed training. We provide the scripts in the scripts folder.

bash scripts/pretrain_webvid.sh

you could run the scripts respectively. --master_addr is set as the ip of the node 0 and --node_rank is set from 0 to 7.

πŸ„ Fine-tuning on MSRVTT

We finetune our SimVTP on MSRVTT with 8 V100. We provide the scripts in the scripts folder.

bash scripts/finetune_msrvtt.sh

You could also add the --only_test to evaluate our finetuned model.

🐧 Model Weight

We provide the pretrained weights and finetuned weight on msrvtt in google driver.

Method Backbone Epoch Pre-train Fine-tune R@1
SimVTP ViT-B 200 script/log/checkpoint script/log/checkpoint 53.6

πŸ‘€ Visualization

We provide the script for visualization in vis.sh. Though not the exact same as original texts, the reconstructed texts are plausible and in harmony with the video content. Sometimes, they are even more accurate than original texts, like the white cat and little boy in the second and third columns

teaser

πŸ”’ License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file.

πŸ‘ Acknowledgement

This project is built upon MAE-pytorch and VideoMAE. Thanks to the contributors of these great codebases.

✏️ Citation

@article{ma2022simvtp,
  title={SimVTP: Simple Video Text Pre-training with Masked Autoencoders},
  author={Ma, Yue and Yang, Tianyu and Shan, Yin and Li, Xiu},
  journal={arXiv preprint arXiv:2212.03490},
  year={2022}
}

About

SimVTP: This repo is the official implementation of "Simple Video Text Pre-training with Masked Autoencoders"


Languages

Language:Python 98.0%Language:Shell 2.0%