Exercise 4.8: Avoid numerical instability in policy
mhoehle opened this issue · comments
Suggested replacement:
op_a = np.argmax(np.round(v, decimals=5))
As written in ShangtongZhang/reinforcement-learning-an-introduction#83 this would result in a more deterministic action choice in situations, where several actions give identical value (which due to floating point errors are not identical). At least when I added the suggested rounding the produced figure(s) resembled Fig. 4.3. more:
This removes the artefacts seen in the original figure produced by the code (Ex4.9_plotB.jpg in the repo or see below):
P.S. The number of digits might have to be increased such that it also works for the digits=12
.