harvardnlp / seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention

Home Page:http://nlp.seas.harvard.edu/code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How did you deal with exposure bias in your implementation?

blackyang opened this issue · comments

Hi,

First thanks for sharing your work. I have some questions about dealing with exposure bias in your implementation. (1) During training, did you use predicted label at time-step t-1 as input at time-step t? (2) If yes, did you use softmax or argmax? How did you do back-prop in these cases? (3) Did you use a teacher model as stated in paper "Sequence-Level Knowledge Distillation"?

It would be great if you could point out the code corresponding to these questions, I didn't find them myself yet. Thanks in advance!

We do not deal with exposure bias issue here.
There have been several attempts to address it:

  • Scheduled Sampling, Bengio et al. NIPS 2015 (roughly, with some probability p sample from the model and use that as the input to the decoder. p is gradually increased during training). Not sure if there are any official implementations but should be trivial to implement.

  • Sequence-level training, Ranzato et al. ICLR 2016 (use likelihood ratiohood trick to backpropagate sequence-level objective).
    https://github.com/facebookresearch/MIXER

  • Beam Search Optimization, Wiseman and Rush EMNLP 2016 (use beam search during training).
    https://github.com/harvardnlp/BSO

Hope this helps!

@yoonkim thanks for your detailed reply! Yeah I recently read these papers, just want to see how people would actually implement them in their code