This repo hosts the source code for the baseline models described in WebQA: Multihop and Multimodal QA.

All models were initialized from the released VLP checkpoints.

We release checkpoints fine-tuned on WebQA here.


Update (29 Sep, 2021):

We clarify here what are arguments --gold_feature_folder, --distractor_feature_folder, and --x_distractor_feature_folder. Basically, during implementation we divide the images into 3 buckets: positive images for image-based queries (gold), negative images for image-based queries (distractors) and negative images for text-based queries (x_distractors), where the 'x' stands for 'cross-modality'. Image- and text-based queries can be disinguished via the "Qcate" field in the dataset file. Text-based queries all have Qcate == 'text', while the rest are image-based ones.


cd VLP
conda env create -f misc/vlp.yml --prefix /home/<username>/miniconda3/envs/vlp
conda activate vlp

Clone repo

cd apex
git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
python install --cuda_ext --cpp_ext
pip install datasets==1.7.0
pip install opencv-python== 

Visual Features Extraction

  • X101fpn

The detectron2-based feature extraction code is available under this repo. Part of the code is based on LuoweiZhou/detectron-vlp and facebookresearch/detectron2

Download checkpoint

  • VinVL

Please refer to pzzhang/VinVL and microsoft/scene_graph_benchmark

Download checkpoint


cd vlp

Retrieval training

python --new_segment_ids --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 128 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --use_x_distractors --do_train --num_train_epochs 6

Retrieval inference

python --new_segment_ids --train_batch_size 16 --split val --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 8 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --recover_step 3 --use_x_distractors

QA training

python --new_segment_ids --do_train --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'qa' --num_workers 4 --max_pred 50 --mask_prob 0.5 --learning_rate 1e-4 --gradient_accumulation_steps 64 --save_loss_curve --num_train_epochs 16 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug

QA decode

python --new_segment_ids --batch_size 32 --answer_provided_by "img|txt" --beam_size 5 --split "test" --num_workers 4 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug --no_eval --recover_step 11

With VinVL features, run or instead.


Please acknowledge the following paper if you use the code:

 title ={{WebQA: Multihop and Multimodal QA}},
 author={Yinghsan Chang and Mridu Narang and
         Hisami Suzuki and Guihong Cao and
         Jianfeng Gao and Yonatan Bisk},
 journal = {ArXiv},
 year = {2021},
 url  = {}

Related Projects/Codebase


Our code is mainly based on Zhou et al.'s VLP repo. We thank the authors for their valuable work.


