About the loss_distill
sushizixin opened this issue · comments
Hi, thank you for the excellent work and the release of the code!
I am a little confused about the approach to calculating loss_distill
in line 1429 of xbert.py
as shown in
loss_distill = -torch.sum(F.log_softmax(prediction_scores, dim=1)*soft_labels,dim=-1)
I think the size of both prediction_scores
and soft_labels
would be (batch_size, seq_len, vocab_size). And F.softmax
is used in the last dimension for soft_labels
in line 237 of model_pretrain.py
, as shown in
mlm_output = self.text_encoder(input_ids,
attention_mask = text.attention_mask,
encoder_hidden_states = image_embeds,
encoder_attention_mask = image_atts,
return_dict = True,
labels = labels,
soft_labels = F.softmax(logits_m,dim=-1),
alpha = alpha
)
Why is F.log_softmax
used in the second dimension (dimension of seq_len) for prediction_scores
?
Hi, thanks for your interest in our work.
Have you printed out the size of prediction_scores
and soft_labels
?
Hi, thanks for your response.
I printed out the size of them by setting batch_size
in Pretrain.yaml
to 2 and find that they have the same size, e.g. (2, 11, 30522).