PPO update mistake?

Question

PPO update mistake?

zcaicaros opened this issue 4 years ago · comments

In line 110:

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net() # <------- HERE

I think it should be left shifted to align with while not done, i.e. after collecting data of one episode, we update the networks' parameters. I have tested and this gives stable performance.

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

        model.train_net() # <------- UPDATED

Seungeun Rho · Answer 1 · Wed Nov 11 2020 14:25:23 GMT+0800 (China Standard Time)

Either way is fine!
Updating the policy after each episodes leads to larger mini_batch size.
But the size of mini_batch changes for each update because the length of each episodes alters.
That's why I updated the model for every fixed length interval (T_horizon in this case).
Thank you!