dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.

Home Page:http://www.wildml.com/2016/10/learning-reinforcement-learning/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gambler's Problem: 0 Stake Allowed?

mparigi opened this issue · comments

In the solution, it says "Your minimum bet is 1". However, the specification says "The actions are stakes, a ∈ {0, 1, . . . , min(s, 100 − s)}", implying a bet of 0 is fine. Which is correct?

A bit late to the discussion, but in this problem it's actually not advisable to use 0 as stake because it's an undiscounted MDP.

A stake of 0 gives a reward of 0, which might be ok in a discounted MDP, because the return decreases with time steps, but not in an undiscounted MDP (gamma = 1), especially because the reward is 0 and it ends in the same state, because it will end up considering 0 as a best action (it ends in the same state, and there's no cost because the reward is 0 and is undiscounted, so it has the same value).

If there's a negative reward for the action, or it was a discounted case, it would be ok.

To give it a bit of perspective, you can consider the following cases for a capital of 99 (only stakes of 0 and 1 are allowed):

  • You define the stake as 1 (either win or ends with a capital of 98): this has some value (based on the return, that is defined based on the probabilities of each case (ph and 1 - ph) and rewards, and the values of the next states, which is not relevant here).
  • You define a stake of 0, ends up with 99 of capital (the same as before), repeat betting with a stake of 0 one million times, ends up still with 99, then you define a stake of 1: this has the same value as the previous case (because it's an undiscounted MDP and a stake of 0 gives a reward of 0 and ends in the same state, creating a loop).

That said, you might be able to use 0 as an action if you always consider the highest stake for the policy (the stake 0 might be the best, but not the unique best action: there will be at least one more stake that is the best too). I haven't tried doing this, tough.

You can see differences in policies for the same values, considering the smallest or highest possible stake, at the other issue regarding the same exercise: #172