ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

CVPR 2021, Oral, Best Student Paper Honorable Mention.

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning for image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

Image-text pretraining on COCO and VG captions.
Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
Video-QA finetuning on TGIF-QA and MSRVTT-QA.
Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

Create a folder that stores pretrained models, all the data, and results.

PATH_TO_STORAGE=/path/to/your/data/
mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models

Download pretrained models.

Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.
```
bash scripts/download_pretrained.sh $PATH_TO_STORAGE
```
This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.
Install required packages

cd docker
pip install -r requirements.txt
pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# apex
git clone https://github.com/NVIDIA/apex.git &&\
    cd apex &&\
    git reset --hard 3fe10b5597ba14a748ebb271a6ab97c09c5701ac &&\
    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . &&\
    rm -rf ../apex
# detectron
pip install 'git+https://github.com/facebookresearch/fvcore'
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@ffff8ac'
# use the faster pillow-simd instead of the original pillow
pip uninstall pillow && \
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

Downstream Task Finetuning

Image Question Answering (VQA)

Download data

# outside the container
# download COCO and VG data
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
# download VQA annotations
bash scripts/download_vqa.sh $PATH_TO_STORAGE

Finetuning

# inside the container
PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node=4 src/tasks/run_vqa.py \
    --config src/configs/vqa_base_resnet50.json \
    --output_dir $OUTPUT_DIR

Inference

# inside the container
PYTHONPATH=. python python -m torch.distributed.launch --nproc_per_node=4 src/tasks/run_vqa.py \
  --do_inference 1 --output_dir $OUTPUT_DIR \
  --inference_split val --inference_model_step $STEP \
  --inference_txt_db $TXT_DB \
  --inference_img_db $IMG_DB \
  --inference_batch_size 64

Pretraining

Download data

# outside the container
bash scripts/download_coco_vg.sh $PATH_TO_STORAGE

Pretraining

#inside the container
horovodrun -np 8 python src/pretrain/run_pretrain.py \
    --config src/configs/pretrain_image_text_base_resnet50_mlm_itm.json \
    --output_dir $OUTPUT_DIR

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  booktitle={CVPR},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen, Ruotian Luo, and other members and interns at Microsoft Multimodal AI for their helpful discussions. We also thank the anonymous reviewers for their constructive feedback.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Steve-Tod / ClipBERT

ClipBERT

Requirements

Getting Started

General

Downstream Task Finetuning

Image Question Answering (VQA)

Pretraining

Data Preprocessing

Citation

Acknowledgement

License

About

Languages