airsplay / R2R-EnvDrop

Hey Hao,

Been working my way through your repo, and when training the listener via the provided script (agent.bash), it seems that the v0 listener model is trained with a hybrid teacher-forcing + sampling approach with RL.

Specifically, you first seem do be doing a Teacher-Forcing update (which makes sense):

R2R-EnvDrop/r2r_src/agent.py

Line 805 in c416108

self.rollout(train_ml=args.ml_weight, train_rl=False, **kwargs)

. You weight this by the provided hyperparameter mlWeight = 0.2.

However, then you do a sampling update that computes the RL (A2C) loss as well:

R2R-EnvDrop/r2r_src/agent.py

Line 807 in c416108

self.rollout(train_ml=None, train_rl=True, **kwargs)

, which then triggers this:

R2R-EnvDrop/r2r_src/agent.py

Line 392 in c416108

if train_rl:

.

If I wanted to just train the "best" Listener model without RL, do you have recommendations. Setting "feedback = argmax" seems to trigger student forcing (which is what's used in the related work), but should I mix that with Teacher Forcing as well?

Any intuition you have is much appreciated. Computing the Teacher-Forcing loss and weighting it by the hyperparameter, and adding the StudentForcing loss is what I'm currently thinking. Otherwise, I might just do StudentForcing all the way through...

Thanks. I have not tried mixing the teaching forcing and student forcing together. What I found before is that teacher forcing works better than student forcing with this code base. Thus I finally use teacher forcing (SF) as the baseline and experiments with this train_ml=1.0 and train_rl=None. Looking forward to seeing whether TF + SF would win over TF!

Got it - thanks Hao, really appreciate it!

Vanilla Listener (Follower) Training without RL (Student Forcing + Teacher Forcing?)