Stephen0808/WebQA

This repo hosts the source code for the baseline models described in WebQA: Multihop and Multimodal QA.

All models were initialized from the released VLP checkpoints.

We release checkpoints fine-tuned on WebQA here.

News

Update (29 Sep, 2021):

We clarify here what are arguments --gold_feature_folder, --distractor_feature_folder, and --x_distractor_feature_folder. Basically, during implementation we divide the images into 3 buckets: positive images for image-based queries (gold), negative images for image-based queries (distractors) and negative images for text-based queries (x_distractors), where the 'x' stands for 'cross-modality'. Image- and text-based queries can be disinguished via the "Qcate" field in the dataset file. Text-based queries all have Qcate == 'text', while the rest are image-based ones.

Environment

cd VLP
conda env create -f misc/vlp.yml --prefix /home/<username>/miniconda3/envs/vlp
conda activate vlp

Clone repo https://github.com/NVIDIA/apex

cd apex
git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
python setup.py install --cuda_ext --cpp_ext

pip install datasets==1.7.0
pip install opencv-python==3.4.2.17

Visual Features Extraction

X101fpn

The detectron2-based feature extraction code is available under this repo. Part of the code is based on LuoweiZhou/detectron-vlp and facebookresearch/detectron2

Download checkpoint

VinVL

Please refer to pzzhang/VinVL and microsoft/scene_graph_benchmark

Download checkpoint

Commands

cd vlp

Retrieval training

python run_webqa.py --new_segment_ids --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 128 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --use_x_distractors --do_train --num_train_epochs 6

Retrieval inference

python run_webqa.py --new_segment_ids --train_batch_size 16 --split val --answer_provided_by 'img|txt' --task_to_learn 'filter' --num_workers 4 --max_pred 10 --mask_prob 1.0 --learning_rate 3e-5 --gradient_accumulation_steps 8 --save_loss_curve --output_dir light_output/filter_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/filter_debug --recover_step 3 --use_x_distractors

QA training

python run_webqa.py --new_segment_ids --do_train --train_batch_size 128 --split train --answer_provided_by 'img|txt' --task_to_learn 'qa' --num_workers 4 --max_pred 50 --mask_prob 0.5 --learning_rate 1e-4 --gradient_accumulation_steps 64 --save_loss_curve --num_train_epochs 16 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug

QA decode

python decode_webqa.py --new_segment_ids --batch_size 32 --answer_provided_by "img|txt" --beam_size 5 --split "test" --num_workers 4 --output_dir light_output/qa_debug --ckpts_dir /data/yingshac/MMMHQA/ckpts/qa_debug --no_eval --recover_step 11

With VinVL features, run run_webqa_vinvl.py or decode_webqa_vinvl.py instead.

Reference

Please acknowledge the following paper if you use the code:

@inproceedings{WebQA21,
 title ={{WebQA: Multihop and Multimodal QA}},
 author={Yinghsan Chang and Mridu Narang and
         Hisami Suzuki and Guihong Cao and
         Jianfeng Gao and Yonatan Bisk},
 journal = {ArXiv},
 year = {2021},
 url  = {https://arxiv.org/abs/2109.00590}
}

Related Projects/Codebase

VinVL Visual Representations: https://github.com/pzzhang/VinVL
Scene Graph Benchmark: https://github.com/microsoft/scene_graph_benchmark
Detectron2: https://github.com/facebookresearch/detectron2

Acknowledgement

Our code is mainly based on Zhou et al.'s VLP repo. We thank the authors for their valuable work.

Stephen0808 / WebQA