This repo hosts the source code for the baseline models described in WebQA: Multihop and Multimodal QA.
All models were initialized from the released VLP checkpoints.
We release checkpoints fine-tuned on WebQA here.
Update (29 Sep, 2021):
We clarify here what are arguments --gold_feature_folder
, --distractor_feature_folder
, and --x_distractor_feature_folder
. Basically, during implementation we divide the images into 3 buckets: positive images for image-based queries (gold
), negative images for image-based queries (distractors
) and negative images for text-based queries (x_distractors
), where the 'x' stands for 'cross-modality'. Image- and text-based queries can be disinguished via the "Qcate" field in the dataset file. Text-based queries all have Qcate == 'text'
, while the rest are image-based ones.
cd VLP
conda env create -f misc/vlp.yml --prefix /home/<username>/miniconda3/envs/vlp
conda activate vlp
Clone repo https://github.com/NVIDIA/apex
cd apex
git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
python setup.py install --cuda_ext --cpp_ext
pip install datasets==1.7.0
pip install opencv-python==3.4.2.17
- X101fpn
The detectron2-based feature extraction code is available under this repo. Part of the code is based on LuoweiZhou/detectron-vlp and facebookresearch/detectron2
- VinVL
Please refer to pzzhang/VinVL and microsoft/scene_graph_benchmark
cd vlp
Retrieval training
python run_webqa.py --new_segment_ids --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 128 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --use_x_distractors --do_train --num_train_epochs 6
Retrieval inference
python run_webqa.py --new_segment_ids --train_batch_size 16 --split val --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 8 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --recover_step 3 --use_x_distractors
QA training
python run_webqa.py --new_segment_ids --do_train --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'qa' --num_workers 4 --max_pred 50 --mask_prob 0.5 --learning_rate 1e-4 --gradient_accumulation_steps 64 --save_loss_curve --num_train_epochs 16 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug
QA decode
python decode_webqa.py --new_segment_ids --batch_size 32 --answer_provided_by "img|txt" --beam_size 5 --split "test" --num_workers 4 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug --no_eval --recover_step 11
With VinVL features, run run_webqa_vinvl.py
or decode_webqa_vinvl.py
instead.
Please acknowledge the following paper if you use the code:
@inproceedings{WebQA21,
title ={{WebQA: Multihop and Multimodal QA}},
author={Yinghsan Chang and Mridu Narang and
Hisami Suzuki and Guihong Cao and
Jianfeng Gao and Yonatan Bisk},
journal = {ArXiv},
year = {2021},
url = {https://arxiv.org/abs/2109.00590}
}
- VinVL Visual Representations: https://github.com/pzzhang/VinVL
- Scene Graph Benchmark: https://github.com/microsoft/scene_graph_benchmark
- Detectron2: https://github.com/facebookresearch/detectron2
Our code is mainly based on Zhou et al.'s VLP repo. We thank the authors for their valuable work.