Pseudocode for "better" policy evaluation in CEM
dniku opened this issue · comments
Dmitry Nikulin commented
The end of the notebook suggests evaluating the policy in a "theoretically better" way by sampling an initial action for each initial state uniformly and then playing with the current policy until the end. A user on Coursera forum reports that pseudocode would make the idea clearer.