Student forcing options/roll-out

Question

Student forcing options/roll-out

kylebgorman opened this issue a year ago · comments

After #71, we now can control, for a given training batch, whether teacher or student forcing is used. Some recent work suggests that for sequence-to-sequence models there is an advantage to training with student forcing. Some other work recommends gradually rolling out student forcing during training. I propose that we:

experiment with a flag that simply enables student forcing during training and see if things still converge
also experiment with a linear, batchwise rollout of student forcing; that is:
- for each batch, we draw a random sample such that with probability p we use teacher forcing and with probability 1 - p we use student forcing
- we initialize with p = 1 and after the warmup phase, linearly decrement p so that p = 0 for the last batch

Note that the stochastic option (the second one) is somewhat different from what Bengio et al. do: they do this at the token level. However, this seems harder and slower to implement, so I am suggesting something simpler to start out with.

Both of these can be thought of as hyperparameter free (beyond the boolean decision of whether or not to use student forcing during training at all). If either work we can incorporate into the master branch.