SBVQA 2.0 Official Implementation

This is the official implementation of our paper:

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

How to run?

Coming soon!

Data

Audio files

SBVQA 2.0 dataset = SBVQA 1.0 dataset + The complementary spoken questions

SBVQA 1.0 original data (identical copy of the data from zted/sbvqa repo): Download
The complementary spoken questions: Download

Also, you can download mp3_files_by_question.pkl, a mapper where the key is the textual question and the value is the .mp3 file name, from this link.

To load the mapper, use the following code snippet:

import re
import pickle

def clean_question(text):
    text = text.lower()
    return ' '.join(re.sub(u"[^a-zA-Z ]", "", text,  flags=re.UNICODE).split())

mp3_files_by_question_mapper = pickle.load(open('mp3_files_by_question.pkl', 'rb'))

textual_question = 'Is this a modern interior?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000010.mp3'

textual_question = 'Where can milk be obtained?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000011.mp3'

textual_question = 'What are the payment method of the parking meter?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000012.mp3'

Image files

These links were taken from the VQA Website

train2014 images: Download
val2014 images: Download
test2015 images: Download

Precomputed features

BLIP features (train2014 images): Download
BLIP features (val2014 images): Download
Speech features of the whole SBVQA 2.0 dataset (Joanna only): Download

Pretrained Models

Coming soon!

Authors

Faris Alasmary - farisalasmary

License

This project is licensed under the MIT License - see the LICENSE file for details

Resources

This code is mainly adapted from this repo: Bottom-Up and Top-Down Attention for Visual Question Answering
NeMo Conformer checkpoint we used to develop the model: Download
BLIP model checkpoint finetuned on image captioning used in this repo: Download
VGG19 pretrained used in the SBVQA 1.0 implementation: Download

ToDo

Citation

@article{alasmary2023sbvqa,
	author={Alasmary, Faris and Al-Ahmadi, Saad},
	journal={IEEE Access},
	title={SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions},
	year={2023},
	volume={11},
	number={},
	pages={140967-140980},
	doi={10.1109/ACCESS.2023.3339537}
}

About

The official implementation of the paper: SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

MIT License

Languages

Language:Python 93.3%Language:Shell 6.7%