About image features

Question

About image features

yao-jz opened this issue a year ago · comments

Hello!

I am wondering if you have processed image features for this task before. And do you know what about the model's performance with image features?

Thank you very much!

Lin Weizhe · Answer 1 · Sat Mar 18 2023 17:11:16 GMT+0800 (China Standard Time)

Hi,
One contribution of this work is to use textual features in replacement of image features. This is because in order to leverage image features, one would need to align the image features from another vision model (such as ResNet) with the latent space of the language model. In this RA-VQA framework, combining ResNet features with textual features on the encoder side does not lead to any performance boost. As for other approaches, e.g. mapping network, they typically require pre-training for alignment.
Hope this helps!

Albert · Answer 2 · Sat Mar 18 2023 17:23:43 GMT+0800 (China Standard Time)

Thanks for your reply!

So do you mean if I use an image encoder and a text encoder to get the image and text embeddings seperately, and use a cross-attention/self-attention network to fuse them, the latter attention network should be pre-trained (maybe on other tasks) for the alignment?

If I just concatenate th and send it to the decoder side(without fusion), will it be difficult for the decoder to learn the alignment?

Thanks!

Lin Weizhe · Answer 3 · Sat Mar 18 2023 17:27:27 GMT+0800 (China Standard Time)

Hi,

If you concatenate the combined embeddings (image embedding + text embedding) and pass them to the encoder, pre-training is needed to align the image embeddings with the text embeddings.
If you concatenate the output hidden states of the vision model and the text encoder, you will need to pretrain the text decoder such that it understands the input from the vision model. Otherwise the performance will not be increased (even leads to collapse in training).

Albert · Answer 4 · Sat Mar 18 2023 17:40:53 GMT+0800 (China Standard Time)

Thanks for your explanation!

Have you tried designing pretraining tasks and adding the image features for answer generation or document retrieval?

Albert · Answer 5 · Sat Mar 25 2023 20:58:02 GMT+0800 (China Standard Time)

Hi,

I added visual features to the model, designed two pretraining tasks to learn the alignment and finetuned the model on OK-VQA. It turned out that it could improve the performance a little.

Lin Weizhe · Answer 6 · Sat Mar 25 2023 21:05:35 GMT+0800 (China Standard Time)

Hi,

Sorry, I think I accidentally missed your last message. Thanks for letting me know. Yes, as I noted, it is possible to add image features to the query encoder by learning an alignment. Just curious: what is the source of your image features (the VinVL features or from another vision model)? Did you combine the image features with the original DPR query hidden states before the cross product? How much performance does it improve?
Of course, feel free to keep it as your research secret. I am asking since I am curious about the pretraining you applied to make it work.

Best,
Weizhe

Albert · Answer 7 · Sat Mar 25 2023 21:37:06 GMT+0800 (China Standard Time)

I use the ViT's image feature for answer generation, but not the retrieval. I use 150K pieces of data for each pretraining task just for the validation of pretraining. I will add more pretraining data, and try to pretrain the DPR later.

Lin Weizhe · Answer 8 · Sat Mar 25 2023 22:02:12 GMT+0800 (China Standard Time)

Thanks for sharing! That's very interesting. Did you use a mapping network?
This sounds very reasonable. I believe that your result suggests the complementary power of ViT features to Image-to-Text Transform!

Albert · Answer 9 · Sat Mar 25 2023 22:41:33 GMT+0800 (China Standard Time)

I use an image encoder, a text encoder, two additional cross-attention networks, and one self-attention network for modality combination. The decoder is just a T5 decoder.

But I didn't use the ViT features for Image-to-Text transform. What do you mean? Thanks!

Lin Weizhe · Answer 10 · Sat Mar 25 2023 23:13:58 GMT+0800 (China Standard Time)

No worries. I simply mean adding the image encoder and the additional attention networks improves over the original framework which only uses image-to-text transform. Great work!