Chapter08: exploration in the validation procedure, is it an issue ?

Question

Chapter08: exploration in the validation procedure, is it an issue ?

domixit opened this issue 5 years ago · comments

Hello Max ...
great work that allow us to dig into RL world ...

in chapter08 code :

in the validation.py procedure I have noticed that epsilon is kept to a non zero (default to 0.2) which means that the policy is not greedy but rather epsilon-greedy,
thi smeans that 2 out of 10 actions are random!!!

RL teory says that it should only be greedy (epsilon=0)
is it an error or deliberately done ?

Max Lapan · Answer 1 · Mon May 27 2019 01:56:21 GMT+0800 (China Standard Time)

Hi!

Good question! It wasn't stated explicitly in the book, will add in the 2nd edition couple of sentences about this.

The reason behind non-zero eps is to test the robustness of our policy by introducing the noice into testing sequence. The main motivation behind this is we don't want the network to just remember and replay some best sequence of actions (which could easily be the only sequence in deterministic environments). We want the network to be robust and know how to recover from random perturbations. So, we inject the random actions and do several tests. In fact, 0.2 could be a bit too high, probably leftover from some experiment. I normally use 2-5%.