Vanilla Listener (Follower) Training without RL (Student Forcing + Teacher Forcing?)
siddk opened this issue · comments
Hey Hao,
Been working my way through your repo, and when training the listener via the provided script (agent.bash), it seems that the v0 listener model is trained with a hybrid teacher-forcing + sampling approach with RL.
Specifically, you first seem do be doing a Teacher-Forcing update (which makes sense):
Line 805 in c416108
However, then you do a sampling update that computes the RL (A2C) loss as well:
Line 807 in c416108
Line 392 in c416108
If I wanted to just train the "best" Listener model without RL, do you have recommendations. Setting "feedback = argmax" seems to trigger student forcing (which is what's used in the related work), but should I mix that with Teacher Forcing as well?
Any intuition you have is much appreciated. Computing the Teacher-Forcing loss and weighting it by the hyperparameter, and adding the StudentForcing loss is what I'm currently thinking. Otherwise, I might just do StudentForcing all the way through...
Thanks. I have not tried mixing the teaching forcing and student forcing together. What I found before is that teacher forcing works better than student forcing with this code base. Thus I finally use teacher forcing (SF) as the baseline and experiments with this train_ml=1.0
and train_rl=None
. Looking forward to seeing whether TF + SF would win over TF!
Got it - thanks Hao, really appreciate it!