About PPO

Question

LpLegend opened this issue 4 years ago · comments

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

RuoqueLi · Answer 1 · Wed Jan 20 2021 16:43:37 GMT+0800 (China Standard Time)

I have changed the activate function from relu to tanh, but there is nothing improvement.

wangzhenxiong · Answer 2 · Wed Jul 21 2021 16:36:46 GMT+0800 (China Standard Time)

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

我也遇到这个问题，我咨询elegantrl作者，他说先tahn，再通过torch.distribution来sample action会影响信息熵，所以是没有办法收敛的，但是我不喜欢elegantrl的ppo写法，所以我还在找别人的代码

Pengbo Zhao · Answer 3 · Wed Aug 25 2021 10:55:19 GMT+0800 (China Standard Time)

Have you got the right code yet? Could you copy a link? Very appreciate!!