farisalasmary / sbvqa2.0

The official implementation of the paper: SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SBVQA 2.0 Official Implementation

This is the official implementation of our paper:

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

How to run?

Coming soon!

Data

Audio files

SBVQA 2.0 dataset = SBVQA 1.0 dataset + The complementary spoken questions

  • SBVQA 1.0 original data (identical copy of the data from zted/sbvqa repo): Download
  • The complementary spoken questions: Download

Also, you can download mp3_files_by_question.pkl, a mapper where the key is the textual question and the value is the .mp3 file name, from this link.

To load the mapper, use the following code snippet:

import re
import pickle

def clean_question(text):
    text = text.lower()
    return ' '.join(re.sub(u"[^a-zA-Z ]", "", text,  flags=re.UNICODE).split())

mp3_files_by_question_mapper = pickle.load(open('mp3_files_by_question.pkl', 'rb'))

textual_question = 'Is this a modern interior?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000010.mp3'

textual_question = 'Where can milk be obtained?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000011.mp3'

textual_question = 'What are the payment method of the parking meter?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000012.mp3'

Image files

These links were taken from the VQA Website

Precomputed features

  • BLIP features (train2014 images): Download
  • BLIP features (val2014 images): Download
  • Speech features of the whole SBVQA 2.0 dataset (Joanna only): Download

Pretrained Models

Coming soon!

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

Resources

ToDo

  • speech feature extraction script (NeMo Conformer)
  • noise injection script
  • inference script
  • visual feature extraction script (BLIP ViT)
  • main model training scripts
  • upload find_the_best_speech_encoder.py script
  • our SBVQA 1.0 implementation scripts
  • visualization scripts (GradCAM + attention maps)
  • upload SBVQA 2.0 dataset
  • upload precomputed visual and speech features
  • upload our pretrained models

Citation

@article{alasmary2023sbvqa,
	author={Alasmary, Faris and Al-Ahmadi, Saad},
	journal={IEEE Access},
	title={SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions},
	year={2023},
	volume={11},
	number={},
	pages={140967-140980},
	doi={10.1109/ACCESS.2023.3339537}
}

About

The official implementation of the paper: SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

License:MIT License


Languages

Language:Python 93.3%Language:Shell 6.7%