Here is the implementation of our ACL-2021 Check It Again: Progressive Visual Question Answering via Visual Entailment. This repository contains code modified from here for SAR+SSL and here for SAR+LMH, many thanks!
- python 3.7.6
- pytorch 1.5.0
- zarr
- tdqm
- spacy
- h5py
cd data
bash download.sh
python preprocess_image.py --data trainval
python create_dictionary.py --dataroot vqacp2/
python preprocess_text.py --dataroot vqacp2/ --version v2
cd ..
-
The VQA model applied as Candidate Answer Selector(CAS) is a free choice in our framework. In this paper, we mainly use SSL as CAS.
-
The setting of model training of CAS can be refered in SSL.
-
To build the Dataset for the Answer Re-ranking module based on Visual Entailment, we modified the SSL's code of
VQAFeatureDataset()
in dataset_vqacp.py andevaluate()
in train.py. The modified codes are avaliable inCAS_scripts
, just replace the corresponding class/function in SSL. -
After the Candidate Answers Selecting Module, we can get
train_top20_candidates.json
andtest_top20_candidates.json
files as the training and test set for Answer Re-ranking Module,respectively. There are demos for the two output json file indata4VE
folder:train_dataset4VE_demo.json
,train_dataset4VE_demo.json
.
Builed Top20-Candidate-Answers dataset (entries) for training/test the model of Answer Re-ranking module
If you don't want to train CAS model(e.g. SSL) to build the datasets in the way mentioned above, you can download the rebuiled top20-candidate-answers dataset (with different Qiestion-Answer-Combination strategies) from here(C-train,C-test,R-train,R-test).
- Put the downloaded Pickle files into the
data4VE
folder, then the code will load and rebuild it into theentries
which will be feed in__getitem__()
of dataloader. (Skipping all data preprocessing steps of the Answer Re-ranking based on Visual Entailment directly) - Each entry of the entries rebuiled from this Pickle file includes
image_features
,image_spatials
,top20_score
,question_id
,QA_text_ids
,top20_label
,answer_type
,question_text
,LMH_bias
, whereQA_text_ids
is the question-answer-combination(R/C) ids obtained/preprocessed from the LXMERT tokenizer.
- Train Top12-SAR
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 0 --train_condi_ans_num 12
- Train Top20-SAR
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 0 --train_condi_ans_num 20
- Train Top12-SAR+SSL
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 1 --self_loss_weight 3 --train_condi_ans_num 12
- Train Top20-SAR+SSL
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 1 --self_loss_weight 3 --train_condi_ans_num 20
- Train Top12-SAR+LMH
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 2 --train_condi_ans_num 12
- Train Top20-SAR+LMH
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 2 --train_condi_ans_num 20
The function evaluate()
in SAR_train.py
is used to select the best model during training, without QTD module yet. The trained QTD model is used in SAR_test.py
where we obtain the final test score.
- Evaluate trained SAR model
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_top12_best_model.pth --output saved_models_cp2/result/ --lp 0 --QTD_N4yesno 1 --QTD_N4non_yesno 12
- Evaluate trained SAR+SSL model
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_SSL_top12_best_model.pth --output saved_models_cp2/result/ --lp 1 --QTD_N4yesno 1 --QTD_N4non_yesno 12
- Evaluate trained SAR+LMH model
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_LMH_top12_best_model.pth --output saved_models_cp2/result/ --lp 2 --QTD_N4yesno 2 --QTD_N4non_yesno 12
- Note that we mainly use
R->C
Question-Answer Combination Strategy, which can always achieves or rivals the best performance on SAR/SAR+SSL/SAR+LMH. Specifically, we first use strategyR
(SAR_replace_dataset_vqacp.py
) at training, which is aimed at preventing the model from excessively focusing on the co-occurrence relation between question category and answer, and then use strategyC
(SAR_concatenate_dataset_vqacp.py
) at testing to introduce more information for inference. - Compute detailed accuracy for each answer type:
python comput_score.py --input saved_models_cp2/result/XX.json --dataroot data/vqacp2/cache
If you have any questions related to the code or the paper, feel free to email Qingyi (siqingyi@iie.ac.cn
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
If you found this code is useful, please cite the following paper:
@inproceedings{si-etal-2021-check,
title = "Check It Again:Progressive Visual Question Answering via Visual Entailment",
author = "Si, Qingyi and
Lin, Zheng and
Zheng, Ming yu and
Fu, Peng and
Wang, Weiping",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.317",
doi = "10.18653/v1/2021.acl-long.317",
pages = "4101--4110",
abstract = "While sophisticated neural-based models have achieved remarkable success in Visual Question Answering (VQA), these models tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55{\%} improvement.",
}