About the XBert
TitleZ99 opened this issue · comments
Hi thanks for this wonderful work. I am confused about the CrossAttention Module, In the code of XBERT,when layer_num>=6, the text_encoder will turn into cross attention, however it will do self-attention on text_embeds and then do cross-attention between the text_embeds and image_embeds. I am confused why do self-attention on text_embeds and then do the cross-attention. Can it do self-attention on image_embeds first and then do cross-attention? or Can it only do the cross-attention? Please help me solve this problem when you are convenient. Thank you again!
Hi, thanks for your interest in our paper. The reason is that the text encoder is the first six layers of BERT, while the fusion encoder is the last six layers of BERT. We do self-attention on text_embeds to ensure the text input will go through totally 12 layers as BERT does.
Q1: "Can it do self-attention on image_embeds first and then do cross-attention?".
A1: I don't think it is fair to do so, since the image input has already go through 12-layer ViT.
Q2: "Can it only do the cross-attention?"
A1: This will hurt the learning to MLM, since the text input will only go through 6 self-attention layers totally.
Please feel free to let me know if you might need any information. Thanks.
Sorry for being too late to reply you. Thank you so much for your patience. Now i understand the VLP model more. Thanks again! Respect!