Reward scale
lgvaz opened this issue · comments
Some factors of reward scaling can generates instabilities, like described in #9 .
For alleviating this issue wouldn't it be a good idea to divide log_prob
by reward_scale
instead of multiplying the reward by it? Algorithmically speaking I think this would have the same effect.
That's right, you can alternatively divide log_prob
by reward_scale
for the same effect. It indeed can be slightly more stable especially in the beginning of learning.