uta-smile / TCL

thank you for excellent work and release the code.
After watching you model_pretrain.py. I have a question that you first encode the txt and image and get the features for contrastive loss, then you mask the features encoded by text_encoder and send them to fusion layer.
To the best of my knowledge, the text_features for every token have attend to other token after the text_encoder.So I think the mask stragey dont work.
So can you do me a favor to answer the question or maybe i lack some details.

Hi, thanks for your interest.
For ITM, there is no mask applied to the encoded text features. Note that attention_mask = text.attention_mask is generated by tokenizer, which means we cannot attend to those padding tokens in each sentence.

For MLM, we mask the text input as shown in

TCL/models/model_pretrain.py

Line 220 in 6c7a9eb

    
           input_ids, labels = self.mask(input_ids, self.text_encoder.config.vocab_size, image.device, targets=labels,

.

Please let me know if you might have further questions or need additional information. Thanks.

why mlm work