Ex 6.13

Question

burmecia opened this issue 4 years ago · comments

I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:

Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]

where

\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}

YIFAN WANG · Answer 1 · Wed May 06 2020 06:39:58 GMT+0800 (China Standard Time)

Looks valid.
Will add it to 6.13 and mark your name.

Dominik Veselý · Answer 2 · Wed May 12 2021 21:54:22 GMT+0800 (China Standard Time)

I think it should be made clear, that Q_1 and Q_2 need to be swapped with a probability of 0.5 in each step of the episode.