LuoweiZhou / grounded-video-description-dev

Video Grounding and Captioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Grounded Video Description

This repo hosts the source code for our paper Grounded Video Description. It supports ActivityNet-Entities dataset. We also have code that supports Flickr30k-Entities dataset, hosted at the flickr_branch branch.

teaser results

Note: [42] indicates Masked Transformer

Quick Start


Follow the instructions 1 to 3 in the Requirements section to install required packages.

Download everything

Simply run the following command to download all the data and pre-trained models (total 216GB):

bash tools/

Starter code

Run the following eval code to test if your environment is setup:

python --batch_size 100 --cuda --num_workers 6 --max_epoch 50 --inference_only \
    --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 \
    --seq_length 20 --language_eval --eval_obj_grounding --obj_interact

(Optional) Single-GPU training code for double-check:

python --batch_size 20 --cuda --checkpoint_path save/gvd_starter --id gvd_starter --language_eval

You can now skip to the Training and Validation section!

Requirements (Recommended)

  1. Clone the repo recursively:
git clone --recursive

Make sure all the submodules densevid_eval and coco-caption are included.

  1. Install CUDA 9.0 and CUDNN v7.1. Later versions should be fine, but might need to get the conda env file updated (e.g., for PyTorch).

  2. Install Miniconda (either Miniconda2 or 3, version 4.6+). We recommend using conda environment to install required packages, including Python 3.7 or 2.7, PyTorch 1.1.0 etc.:

MINICONDA_ROOT=[to your Miniconda root directory]
conda env create -f cfgs/conda_env_gvd_py3.yml --prefix $MINICONDA_ROOT/envs/gvd_pytorch1.1
conda activate gvd_pytorch1.1

Note that there have been some breaking changes since PyTorch 1.2 (e.g., bitwise not on torch.bool/torch.uint8 and masked_fill_). This code base could potentially work with PyTorch 1.2+ with corresponding changes made.

(Ongoing) Replace cfgs/conda_env_gvd_py3.yml with cfgs/conda_env_gvd.yml for Python 2.7.

  1. (Optional) If you choose to not use, be sure to install JAVA and download Stanford CoreNLP for SPICE (see here). Also, download and place the reference file under coco-caption/annotations. Download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools directory.

Data Preparation

Download the preprocessed annotation files from here, uncompress and place them under data/anet. Or you can reproduce them all using the data from ActivityNet-Entities repo and the preprocessing script under prepro. Then, download the ground-truth caption annotations (under our val/test splits) from here and same place under data/anet.

The region features and detections are available for download (feature and detection). The region feature file should be decompressed and placed under your feature directory. We refer to the region feature directory as feature_root in the code. The H5 region detection (proposal) file is referred to as proposal_h5 in the code.

The frame-wise appearance (with suffix _resnet.npy) and motion (with suffix _bn.npy) feature files are available here. We refer to this directory as seg_feature_root.

Other auxiliary files, such as the weights from Detectron fc7 layer, are available here. Uncompress and place under the data directory.

Training and Validation

Modify the config file cfgs/anet_res101_vg_feat_10x100prop.yml with the correct dataset and feature paths (or through symlinks). Link tools/anet_entities to your ANet-Entities dataset root location. Create new directories log and results under the root directory to save log and result files.

The example command on running a 8-GPU data parallel job:

For supervised models (with self-attention):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \
    --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \
    --language_eval --w_att2 $w_att2 --w_grd $w_grd --w_cls $w_cls --obj_interact | tee log/$ID

For unsupervised models (without self-attention):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \
    --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \
    --language_eval | tee log/$ID

Arguments: batch_size=240, w_att2=0.05, w_grd=0, w_cls=0.1, ID indicates the model name.

(Optional) Remove --mGPUs to run in single-GPU mode.

Pre-trained Models

The pre-trained models can be downloaded from here (1.5GB). Make sure you uncompress the file under the save directory (create one under the root directory if not exists).

Inference and Testing

For supervised models (ID=anet-sup-0.05-0-0.1-run1):

(standard inference: language evaluation and localization evaluation on generated sentences)

python --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \
    --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \
    --val_split $val_split --densecap_references $references --densecap_verbose --seq_length 20 \
    --language_eval --eval_obj_grounding --obj_interact \
    | tee log/eval-$val_split-$ID-beam$beam_size-standard-inference

(GT inference: localization evaluation on GT sentences)

python --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \
    --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \
    --val_split $val_split --seq_length 40 --eval_obj_grounding_gt --obj_interact \
    | tee log/eval-$val_split-$ID-beam$beam_size-gt-inference

For unsupervised models (ID=anet-unsup-0-0-0-run1), simply remove the --obj_interact option.

Arguments: references="./data/anet/anet_entities_val_1.json ./data/anet/anet_entities_val_2.json", val_split='validation'. If you want to evaluate on the test split, set val_split='testing' and references accordingly and submit the object localization output files under results to the eval server.

You need at least 9GB of free GPU memory for the evaluation.


Please acknowledge the following paper if you use the code:

  title={Grounded Video Description},
  author={Zhou, Luowei and Kalantidis, Yannis and Chen, Xinlei and Corso, Jason J and Rohrbach, Marcus},


We thank Jiasen Lu for his Neural Baby Talk repo. We thank Chih-Yao Ma for his helpful discussions.


This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Portions of the source code are based on the Neural Baby Talk project.


Video Grounding and Captioning



Language:Python 98.8%Language:Shell 1.2%