Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Code for the CVPR 2019 Paper Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Prerequisites

Python 2.7
Pytorch 0.3.0
CUDA 8.0

Installation

Clone the CM-Erase repository

git clone --recursive https://github.com/xh-liu/CM-Erase

Prepare the submodules and associated data

Mask R-CNN: Follow the instructions of my mask-faster-rcnn repo, preparing everything needed for pyutils/mask-faster-rcnn.
REFER API and data: Use the download links of REFER and go to the foloder running make. Follow data/README.md to prepare images and refcoco/refcoco+/refcocog annotations.
refer-parser2: Follow the instructions of refer-parser2 to extract the parsed expressions using Vicente's R1-R7 attributes. Note this sub-module is only used if you want to train the models by yourself.

Training

Prepare the training and evaluation data by running tools/prepro.py:

python tools/prepro.py --dataset refcoco --splitBy unc

Download the Glove pretrained word embeddings at Google Drive.
Extract features using Mask R-CNN, where the head_feats are used in subject module training and ann_feats is used in relationship module training.

CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc

Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.

CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc

Pretrain the network (CM-Att) with ground-truth annotation:

./experiments/scripts/train_mattnet.sh GPU_ID

Train the network with cross-modal erasing (CM-Att-Erase):

./experiments/scripts/train_erase.sh GPU_ID

Evaluation

Evaluate the network with ground-truth annotation:

./experiments/scripts/eval_easy.sh GPU_ID

Evaluate the network with Mask R-CNN detection results:

./experiments/scripts/eval_dets.sh GPU_ID

Pre-trained Models

We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download them from Google Drive and put them under ./output folder.

Citation

If you find our code useful for your research, please consider citing:

@inproceedings{liu2019improving,
  title={Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing},
  author={Liu, Xihui and Wang, Zihao and Shao, Jing and Wang, Xiaogang and Li, Hongsheng},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1950--1959},
  year={2019}
}

@inproceedings{yu2018mattnet,
  title={Mattnet: Modular attention network for referring expression comprehension},
  author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1307--1315},
  year={2018}
}

Acknowledgement

This project is built on Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018.

king-zark / CM-Erase-REG