clip lstm pytorch pytorch-lightning visual-question-answering vizwiz-vqa vqa nlp

Visual Question Answering Using CLIP + LSTM

CLIP + LSTM architecture.

The visual question-answering problem can be described as "asking our computer to reply to the assigned questions about a particular image." In this project, a CLIP + LSTM architecture comes to lend a helping hand, trying to solve the problem. The image and text encoders of CLIP cultivate the given image and question, respectively. The concatenated image-text representation from CLIP is then applied to the vectorized answer text via the Hadamard product before feeding it to LSTM. By a fashion of autoregressive, the answer to the question is finally served to us. Here, the VizWiz-VQA dataset is utilized to train, validate, and test the model. The training set of the dataset is used in the training and validation phases. It is divided by a ratio of 99:1. The validation set of the dataset is employed for testing. The SQuAD and BLEU metrics are utilized to gauge the performance of the model quantitatively. In inference time, the test set of VizWiz-VQA is leveraged.

Experiment

Give yourself a delightful excursion by passing through the line of codes regarding the experiment provided in this notebook.

Result

Quantitative Result

Here are the evaluation metric results of the model.

SQuAD Metric	Score
BLEU 1-gram	44.67%
Exact Match	44.43%
F1-score	44.83%

Loss Curve

Loss curves of the CLIP + LSTM model on the train and validation sets.

Qualitative Result

The following image exhibits the collated results of the VQA model.

A collection of qualitative results containing question-answer-image triads.

Credit

About

Visual Question Answering Using CLIP + LSTM

clip lstm pytorch pytorch-lightning visual-question-answering vizwiz-vqa vqa nlp

Languages

Language:Jupyter Notebook 100.0%