Performance compared with SB3

Question

Performance compared with SB3

qiuruiyu opened this issue a year ago · comments

Problem Description

I found that, on my own customized env, based on gymnasium. It shows a great convergence with stable-baselines3, with a final reward around 17.9. However, when using cleanrl, reward is only about 20, no more higher than that. It makes me really, confused, and the difference of the reward makes the controller trained with RL performance not so good on my evaluation.

The hyperparameters of two training frameworks using PPO are totally same, with a default setting
The env is only wrapped with FlattenObservation, RecordEpisodeStatistics. and I have tried to wrapped with ObservationNormalization, RewardNormalization, but the reward even becomes lower.

Possible Solution

Is there some problem with the trade-off between the exploitation and exploration? Or some problem with the setting of action std?

Jo_QIU · Answer 1 · Sun Jun 25 2023 19:33:13 GMT+0800 (China Standard Time)

update:
When I just save model while training, rewards goes right, but when I add evaluation, rewards get stuck. It's confusing really.

Costa Huang · Answer 2 · Thu Oct 12 2023 20:26:48 GMT+0800 (China Standard Time)

I am afraid I can't help too much there. SB3's PPO is slightly difference. See

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Good luck!