LinWeizheDragon / Retrieval-Augmented-Visual-Question-Answering

This is the official repository for Retrieval Augmented Visual Question Answering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About image features

yao-jz opened this issue · comments

commented

Hello!

I am wondering if you have processed image features for this task before. And do you know what about the model's performance with image features?

Thank you very much!

Hi,
One contribution of this work is to use textual features in replacement of image features. This is because in order to leverage image features, one would need to align the image features from another vision model (such as ResNet) with the latent space of the language model. In this RA-VQA framework, combining ResNet features with textual features on the encoder side does not lead to any performance boost. As for other approaches, e.g. mapping network, they typically require pre-training for alignment.
Hope this helps!

commented

Thanks for your reply!

So do you mean if I use an image encoder and a text encoder to get the image and text embeddings seperately, and use a cross-attention/self-attention network to fuse them, the latter attention network should be pre-trained (maybe on other tasks) for the alignment?

If I just concatenate th and send it to the decoder side(without fusion), will it be difficult for the decoder to learn the alignment?

Thanks!

Hi,

If you concatenate the combined embeddings (image embedding + text embedding) and pass them to the encoder, pre-training is needed to align the image embeddings with the text embeddings.
If you concatenate the output hidden states of the vision model and the text encoder, you will need to pretrain the text decoder such that it understands the input from the vision model. Otherwise the performance will not be increased (even leads to collapse in training).

commented

Thanks for your explanation!

Have you tried designing pretraining tasks and adding the image features for answer generation or document retrieval?

commented

Hi,

I added visual features to the model, designed two pretraining tasks to learn the alignment and finetuned the model on OK-VQA. It turned out that it could improve the performance a little.

Hi,

Sorry, I think I accidentally missed your last message. Thanks for letting me know. Yes, as I noted, it is possible to add image features to the query encoder by learning an alignment. Just curious: what is the source of your image features (the VinVL features or from another vision model)? Did you combine the image features with the original DPR query hidden states before the cross product? How much performance does it improve?
Of course, feel free to keep it as your research secret. I am asking since I am curious about the pretraining you applied to make it work.

Best,
Weizhe

commented

I use the ViT's image feature for answer generation, but not the retrieval. I use 150K pieces of data for each pretraining task just for the validation of pretraining. I will add more pretraining data, and try to pretrain the DPR later.

Thanks for sharing! That's very interesting. Did you use a mapping network?
This sounds very reasonable. I believe that your result suggests the complementary power of ViT features to Image-to-Text Transform!

commented

I use an image encoder, a text encoder, two additional cross-attention networks, and one self-attention network for modality combination. The decoder is just a T5 decoder.

But I didn't use the ViT features for Image-to-Text transform. What do you mean? Thanks!

No worries. I simply mean adding the image encoder and the additional attention networks improves over the original framework which only uses image-to-text transform. Great work!