Improving One-stage Visual Grounding by Recursive Sub-query Construction

by Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo

European Conference on Computer Vision (ECCV), 2020

Introduction

We propose a recursive sub-query construction framework to address previous one-stage visual grounding methods' limitations on grounding long and complex queries. For more details, please refer to our paper.

[1] Yang, Zhengyuan, et al. "A fast and accurate one-stage approach to visual grounding". ICCV 2019.

Prerequisites

Python 3.6 (3.5 tested)
Pytorch 0.4.1 and 1.4.0 tested (other versions in between should work)
Others (Pytorch-Bert, etc.) Check requirements.txt for reference.

Installation

Clone the repository

git clone https://github.com/zyang-ur/ReSC.git

Prepare the submodules and associated data

RefCOCO, RefCOCO+, RefCOCOg, ReferItGame Dataset: place the data or the soft link of dataset folder under ./ln_data/. We follow dataset structure DMS. To accomplish this, the download_dataset.sh bash script from DMS can be used.
```
bash ln_data/download_data.sh --path ./ln_data
```

Data index: download the generated index files and place them as the ./data folder. Availble at [Gdrive], [One Drive].
```
rm -r data
tar xf data.tar
```
Model weights: download the pretrained model of Yolov3 and place the file in ./saved_models.
```
sh saved_models/yolov3_weights.sh
```

More pretrained models are availble in the performance table [Gdrive], [One Drive] and should also be placed in ./saved_models.

Training

Train the model, run the code under main folder. Using flag --large to access the ReSC-large model. ReSC-base is the default.

python train.py --data_root ./ln_data/ --dataset referit \
  --gpu gpu_id --resume saved_models/ReSC_base_referit.pth.tar

Evaluate the model, run the code under main folder. Using flag --test to access test mode.

python train.py --data_root ./ln_data/ --dataset referit \
  --gpu gpu_id --resume saved_models/ReSC_base_referit.pth.tar --test

Implementation Details

We train 100 epoches with batch size 8 on all datasets expect RefCOCOg, where we find training 20/40 epoches have the best performance. We fix the bert weights during training as the default. The language encoder can be finetuned with the flag --tunebert. We observe a small improvenment on some datasets (e.g. RefCOCOg). Please check other experiment settings in our paper.

Performance and Pre-trained Models

Pre-trained models are availble in [Gdrive], [One Drive].

Dataset	Ours-base (Acc@0.5)	Ours-large (Acc@0.5)
RefCOCO	val: 76.74	val: 78.09
	testA: 78.61	testA: 80.89
	testB: 71.85	testB: 72.97
RefCOCO+	val: 63.21	val: 62.97
	testA: 65.94	testA: 67.13
	testB: 56.08	testB: 55.43
RefCOCOg	val-g: 61.12	val-g: 62.22
	val-umd: 64.89	val-umd: 67.50
	test-umd: 64.01	test-umd: 66.55
ReferItGame	val: 66.78	val: 67.15
ReferItGame	test: 64.33	test: 64.70

Citation

@inproceedings{yang2020improving,
  title={Improving One-stage Visual Grounding by Recursive Sub-query Construction},
  author={Yang, Zhengyuan and Chen, Tianlang and Wang, Liwei and Luo, Jiebo},
  booktitle={ECCV},
  year={2020}
}
@inproceedings{yang2019fast,
  title={A Fast and Accurate One-Stage Approach to Visual Grounding},
  author={Yang, Zhengyuan and Gong, Boqing and Wang, Liwei and Huang
    , Wenbing and Yu, Dong and Luo, Jiebo},
  booktitle={ICCV},
  year={2019}
}

Credits

Our code is built on Onestage-VG.

Part of the code or models are from DMS, film, MAttNet, Yolov3 and Pytorch-yolov3.

About

Improving One-stage Visual Grounding by Recursive Sub-query Construction, ECCV 2020

MIT License

Languages

Language:Python 96.6%Language:Shell 3.4%