uta-smile / TCL

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the XBert

TitleZ99 opened this issue · comments

Hi thanks for this wonderful work. I am confused about the CrossAttention Module, In the code of XBERT,when layer_num>=6, the text_encoder will turn into cross attention, however it will do self-attention on text_embeds and then do cross-attention between the text_embeds and image_embeds. I am confused why do self-attention on text_embeds and then do the cross-attention. Can it do self-attention on image_embeds first and then do cross-attention? or Can it only do the cross-attention? Please help me solve this problem when you are convenient. Thank you again!

commented

Hi, thanks for your interest in our paper. The reason is that the text encoder is the first six layers of BERT, while the fusion encoder is the last six layers of BERT. We do self-attention on text_embeds to ensure the text input will go through totally 12 layers as BERT does.
Q1: "Can it do self-attention on image_embeds first and then do cross-attention?".
A1: I don't think it is fair to do so, since the image input has already go through 12-layer ViT.

Q2: "Can it only do the cross-attention?"
A1: This will hurt the learning to MLM, since the text input will only go through 6 self-attention layers totally.

Please feel free to let me know if you might need any information. Thanks.

Sorry for being too late to reply you. Thank you so much for your patience. Now i understand the VLP model more. Thanks again! Respect!