Simple Policy Faulty Loss Function

Question

wert23239 opened this issue 7 years ago · comments

Your loss function for the simple policy doesn't really make sense

"Loss=-Log(pi)*A"

If you have a weight of .9 and reward of 1
your loss is .045.

but if you have a weight of .9 and your reward is 3
your loss increases to .09 .

So the only reason your function works at all is that you only assign a single amount of reward.