Deep Modular Co-Attention Networks (MCAN)

This repository corresponds to the PyTorch implementation of the MCAN for VQA, which won the champion in VQA Challgen 2019. With an ensemble of 27 models, we achieved an overall accuracy 75.23% and 75.26% on test-std and test-challenge splits, respectively. See our slides for details.

By using the commonly used bottom-up-attention visual features, a single MCAN model delivers 70.70% (small model) and 70.93% (large model) overall accuracy on the test-dev split of VQA-v2 dataset respectively, which significantly outperform existing state-of-the-arts. Please check our paper for details.

Changes on this fork by Kayo:

Only download and use validation set
Edit core to allow use on CPU
Add script for validation

Prerequisites

Setup

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz file. Download the extracted features from OneDrive or BaiduYun. Download the val2014.tar.gz package corresponding to the features for the val images for VQA-v2. You should place it as follow:

|-- datasets
	|-- coco_extract
	|  |-- val2014.tar.gz

Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:

Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).

For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and unzip and place them as follow:

|-- datasets
	|-- vqa
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

After that, you can run the following script to setup all the needed configurations for the experiments

$ virtualenv --python=python3 venv
$ source venv/bin/activate
$ sh setup.sh

Running the script will:

Download the QA files for VQA-v2.
Install requirements

Finally, the datasets folders will have the following structure:

|-- datasets
	|-- coco_extract
	|  |-- val2014
	|  |  |-- COCO_val2014_...jpg.npz
	|  |  |-- ...
	|-- vqa
	|  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |-- v2_mscoco_val2014_annotations.json
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

Download the pretrained small model and place it as below from OneDrive or BaiduYun, and you should unzip and put them to the correct folders as follows:

|-- ckpts
	|-- ckpt_small
	|  |-- epoch13.pkl

Validation and Testing

run sh validate.sh

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@inProceedings{yu2019mcan,
  author = {Yu, Zhou and Yu, Jun and Cui, Yuhao and Tao, Dacheng and Tian, Qi},
  title = {Deep Modular Co-Attention Networks for Visual Question Answering},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages = {6281--6290},
  year = {2019}
}

kayoyin / mcan-vqa