[Bug Report] Calculating rewards for Blackjack toy-text env
peterhungh3 opened this issue · comments
Describe the bug
In blackjack.py: reward = cmp(score(self.player), score(self.dealer))
Current rewards seem wrong for 2 edge cases:
- Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
- Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).
Code example
In blackjack.py:
Inside function step():
reward = cmp(score(self.player), score(self.dealer))
And definition of score() and cmp():
def score(hand): # What is the score of this hand (0 if bust)
return 0 if is_bust(hand) else sum_hand(hand)
def cmp(a, b):
return float(a > b) - float(a < b)
System info
- Gymnasium, main branch
- Python 3.9
Additional context
No response
Checklist
- I have checked that there is no similar issue in the repo
I'm surprised that no one has noticed this issue with the reward function though admittedly on edge cases
What is your suggested changes to the reward in terms of code?
Could you provide some testing code for this behaviour? (you can use known seeds and actions to testing particular outcomes)
For case 1: player 21 vs Dealer BJ and vice versa. Currently, the returned reward is 0 (draw)
Ex code to test for player = BJ + dealer = 21
from gymnasium.envs.toy_text.blackjack import BlackjackEnv, sum_hand
env = BlackjackEnv(natural=True)
while True:
obs, _ = env.reset()
if obs[0] == 21: # Player BJ
action = 0
next_obs, reward, terminated, truncated, info = env.step(action)
if sum_hand(env.dealer) == 21 and len(env.dealer) > 2: # 21
print(f"Player: {env.player} & dealer = {env.dealer}, "
f"reward = {reward}")
assert reward == 1.5, reward
Ex code to test for player = 21 & dealer = BJ
env = BlackjackEnv(natural=True)
while True:
obs, _ = env.reset()
if sum_hand(env.player) == 21: # ignore player BJ, we want to find 21
continue
action = 1
next_obs, reward, terminated, truncated, info = env.step(action)
if (sum_hand(env.player) == 21 and len(env.player) > 2 and # player 21
sum_hand(env.dealer) == 21 and len(env.dealer) == 2 # dealer BJ
):
print(f"Player: {env.player} & dealer = {env.dealer}, "
f"reward = {reward}")
assert reward == -1, reward
For case 2: when both busted: I've rechecked and actually the current codes could handle this.
This was my mistake as I was trying to extend the env to support double-down and that case happened to enter this code path:
reward = cmp(score(self.player), score(self.dealer))
which would produce a reward of 0, which is incorrect. But the current codes already handle the busted player case in another path.
Nevertheless, the above line of code still seems a bit "dangerous" as it made me think it would seem to be able to handle all cases.
Could you make a PR with the suggested changes and tests for the relative rules