How did you deal with exposure bias in your implementation?
blackyang opened this issue · comments
Hi,
First thanks for sharing your work. I have some questions about dealing with exposure bias in your implementation. (1) During training, did you use predicted label at time-step t-1
as input at time-step t
? (2) If yes, did you use softmax or argmax? How did you do back-prop in these cases? (3) Did you use a teacher model as stated in paper "Sequence-Level Knowledge Distillation"?
It would be great if you could point out the code corresponding to these questions, I didn't find them myself yet. Thanks in advance!
We do not deal with exposure bias issue here.
There have been several attempts to address it:
-
Scheduled Sampling, Bengio et al. NIPS 2015 (roughly, with some probability p sample from the model and use that as the input to the decoder. p is gradually increased during training). Not sure if there are any official implementations but should be trivial to implement.
-
Sequence-level training, Ranzato et al. ICLR 2016 (use likelihood ratiohood trick to backpropagate sequence-level objective).
https://github.com/facebookresearch/MIXER -
Beam Search Optimization, Wiseman and Rush EMNLP 2016 (use beam search during training).
https://github.com/harvardnlp/BSO
Hope this helps!