Pseudocode for "better" policy evaluation in CEM

Question

Pseudocode for "better" policy evaluation in CEM

dniku opened this issue 4 years ago · comments

The end of the notebook suggests evaluating the policy in a "theoretically better" way by sampling an initial action for each initial state uniformly and then playing with the current policy until the end. A user on Coursera forum reports that pseudocode would make the idea clearer.