Have you considered using a PPO actor instead of a normal Actor-Critic?

Question

Have you considered using a PPO actor instead of a normal Actor-Critic?

outdoteth opened this issue 3 years ago · comments

I think a lot of improvement could be made by using a PPO actor.

Danijar Hafner · Answer 1 · Sun Mar 07 2021 06:01:39 GMT+0800 (China Standard Time)

PPO clips the advantage values so that it can safely train on on-policy data for multiple gradient steps. DreamerV2 uses a world model and thus can generate an unlimited amount of on-policy data without having to interact with the environment, so there is not much of a point in training on the same imagined trajectories multiple times.