uta-smile / TCL

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why mlm work

ksblk2116 opened this issue · comments

thank you for excellent work and release the code.
After watching you model_pretrain.py. I have a question that you first encode the txt and image and get the features for contrastive loss, then you mask the features encoded by text_encoder and send them to fusion layer.
To the best of my knowledge, the text_features for every token have attend to other token after the text_encoder.So I think the mask stragey dont work.
So can you do me a favor to answer the question or maybe i lack some details.

commented

Hi, thanks for your interest.
For ITM, there is no mask applied to the encoded text features. Note that attention_mask = text.attention_mask is generated by tokenizer, which means we cannot attend to those padding tokens in each sentence.

For MLM, we mask the text input as shown in

input_ids, labels = self.mask(input_ids, self.text_encoder.config.vocab_size, image.device, targets=labels,
.

Please let me know if you might have further questions or need additional information. Thanks.