LyWangPX / Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions

Solutions of Reinforcement Learning, An Introduction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ex 6.13

burmecia opened this issue · comments

commented

I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:

image

Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]

where

image

\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}

Looks valid.
Will add it to 6.13 and mark your name.

I think it should be made clear, that Q_1 and Q_2 need to be swapped with a probability of 0.5 in each step of the episode.