PPO update mistake?
zcaicaros opened this issue · comments
In line 110:
for n_epi in range(10000):
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
prob = model.pi(torch.from_numpy(s).float())
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
s = s_prime
score += r
if done:
break
model.train_net() # <------- HERE
I think it should be left shifted to align with while not done
, i.e. after collecting data of one episode, we update the networks' parameters. I have tested and this gives stable performance.
for n_epi in range(10000):
s = env.reset()
done = False
while not done:
for t in range(T_horizon):
prob = model.pi(torch.from_numpy(s).float())
m = Categorical(prob)
a = m.sample().item()
s_prime, r, done, info = env.step(a)
model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
s = s_prime
score += r
if done:
break
model.train_net() # <------- UPDATED
Either way is fine!
Updating the policy after each episodes leads to larger mini_batch size.
But the size of mini_batch changes for each update because the length of each episodes alters.
That's why I updated the model for every fixed length interval (T_horizon in this case).
Thank you!