label smoothing mistake
youngsheen opened this issue · comments
when compute label smoothing loss, the logit_loss only multiply weight_t while miss the 1/(t+1).
Hi, thanks for your interest!
Technically, both 1/(t+1)
and weight_t
are only associated with the diffusion ELBO objective but not the label smoothing loss. Therefore, it is reasonable to use arbitrary weighting for the label smoothing loss (which is often used as an auxiliary objective for regularization) to scale its effect; we conducted various ablations in our preliminary experiments and found that only multiplying label smoothing loss with weight_t
yields the best performance for translation tasks.
However, it could be true that this choice may not be optimal in all cases and that carefully tuning the weighting in a task-specific manner may lead to better performance.
Hope this clears things up xD