How did you deal with exposure bias in your implementation?

Question

How did you deal with exposure bias in your implementation?

blackyang opened this issue 7 years ago · comments

Hi,

First thanks for sharing your work. I have some questions about dealing with exposure bias in your implementation. (1) During training, did you use predicted label at time-step t-1 as input at time-step t? (2) If yes, did you use softmax or argmax? How did you do back-prop in these cases? (3) Did you use a teacher model as stated in paper "Sequence-Level Knowledge Distillation"?

It would be great if you could point out the code corresponding to these questions, I didn't find them myself yet. Thanks in advance!

Yoon Kim · Answer 1 · Sat Feb 25 2017 05:47:58 GMT+0800 (China Standard Time)

We do not deal with exposure bias issue here.
There have been several attempts to address it:

Scheduled Sampling, Bengio et al. NIPS 2015 (roughly, with some probability p sample from the model and use that as the input to the decoder. p is gradually increased during training). Not sure if there are any official implementations but should be trivial to implement.
Sequence-level training, Ranzato et al. ICLR 2016 (use likelihood ratiohood trick to backpropagate sequence-level objective).
https://github.com/facebookresearch/MIXER
Beam Search Optimization, Wiseman and Rush EMNLP 2016 (use beam search during training).
https://github.com/harvardnlp/BSO

Hope this helps!

Xiao Yang · Answer 2 · Sat Feb 25 2017 05:50:29 GMT+0800 (China Standard Time)

@yoonkim thanks for your detailed reply! Yeah I recently read these papers, just want to see how people would actually implement them in their code