The relative weight of the MLM loss compared to the contrastive loss
hyleemindslab opened this issue · comments
In the paper, Equation 7 indicates that both the MLM and contrastive losses are divided by the effective batch size, whose value would be equal to 2 * per_device_train_batch_size * world_size
. But the MLM loss calculation code seems to divide the MLM loss by per_device_train_batch_size * world_size
(line 227), since the CoCondenserDataset
's __getitem__
method returns two spans belonging to the same document, thereby making the actual batch dimension larger by a factor of 2.
I feel like I am missing something. Could you please help me out?
Lines 219 to 230 in de9c257
Lines 177 to 179 in de9c257
Line 227 is for gradient accumulation scaling, not averaging across batch examples, check out the trainer code
Lines 161 to 185 in de9c257
Yes, I just expected the MLM loss for a sub-batch to be scaled by (# of spans in the sub-batch / # of spans in the local batch)
so that the final gradient is w.r.t. the loss that is averaged across the spans in the batch, which I thought would be written as loss = loss * (float(hiddens.size(0)) / (2 * self.train_args.per_device_train_batch_size))
. But I'm starting to think it may not be that important.
Right, there's a factor of 2. We didn't actually experiment a lot with how to interpolate; the current code seems to work fine. As training progress, with momentum stabilizing in the optimizer, I also expect that it won't be super important.