uta-smile / TCL

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the loss_distill

sushizixin opened this issue · comments

commented

Hi, thank you for the excellent work and the release of the code!

I am a little confused about the approach to calculating loss_distill in line 1429 of xbert.py as shown in

                  loss_distill = -torch.sum(F.log_softmax(prediction_scores, dim=1)*soft_labels,dim=-1)

I think the size of both prediction_scores and soft_labels would be (batch_size, seq_len, vocab_size). And F.softmax is used in the last dimension for soft_labels in line 237 of model_pretrain.py, as shown in

                  mlm_output = self.text_encoder(input_ids, 
                                                 attention_mask = text.attention_mask,
                                                 encoder_hidden_states = image_embeds,
                                                 encoder_attention_mask = image_atts,      
                                                 return_dict = True,
                                                 labels = labels,   
                                                 soft_labels = F.softmax(logits_m,dim=-1),
                                                 alpha = alpha
                                                )

Why is F.log_softmax used in the second dimension (dimension of seq_len) for prediction_scores?

commented

Hi, thanks for your interest in our work.
Have you printed out the size of prediction_scores and soft_labels?

commented

Hi, thanks for your response.
I printed out the size of them by setting batch_size in Pretrain.yaml to 2 and find that they have the same size, e.g. (2, 11, 30522).

commented

Thanks for pointing out this bug. I fixed this issue and merged it to the mainline. More details can be found in this commit: 74a3e4f