VQA in Persian

Pytorch code for my master thesis entitled "Visual Question Answering in Persian Language".

Abstarct

Visual Question Answering is a challenging task introduced recently and received increasing attention from computer vision and the natural language processing communities. Visual Question Answering aims to answer the questions about given images. Most VQA progress is focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence, we introduce \textbf{ParsVQA-Caps}, a Persian benchmark for Visual Question Answering and Image Captioning tasks. We utilise two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. In addition, Present VQA models are limited to classification answers and cannot provide answers for reasoning questions. In this work, we introduce an encoder-decoder model using vision-and-language pretrained embedding, which delivers multi-word generated sentences as answers. We utilise LXMERT, VisualBERT and CLIPfa embedding space with three different generative decoder heads, including RNNs, Attention RNNs and Transformers. Our generative VQA model reveals that although, in some examples, the VQA model answer is correct, the description of answers shows that the model may misunderstand the question. Finally, we present a mobile application as an assistant for the blind using our VQA model and propose a method to reduce the model's parameters to make it feasible to use in limited-resource devices.

About

Pytorch code for my master thesis entitled "Visual Question Answering in Persian Language".

Languages

Language:Jupyter Notebook 59.7%Language:HTML 40.0%Language:Python 0.3%Language:Shell 0.0%