[Bug Report] Calculating rewards for Blackjack toy-text env

Question

[Bug Report] Calculating rewards for Blackjack toy-text env

peterhungh3 opened this issue 2 months ago · comments

Peter Hoang commented 2 months ago

Describe the bug

In blackjack.py: reward = cmp(score(self.player), score(self.dealer))

Current rewards seem wrong for 2 edge cases:

Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

Code example

In blackjack.py: 
Inside function step():
reward = cmp(score(self.player), score(self.dealer))

And definition of score() and cmp(): 
def score(hand):  # What is the score of this hand (0 if bust)
    return 0 if is_bust(hand) else sum_hand(hand)
def cmp(a, b):
    return float(a > b) - float(a < b)

System info

Gymnasium, main branch
Python 3.9

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

Mark Towers · Answer 1 · Fri Apr 12 2024 05:44:35 GMT+0800 (China Standard Time)

I'm surprised that no one has noticed this issue with the reward function though admittedly on edge cases

What is your suggested changes to the reward in terms of code?
Could you provide some testing code for this behaviour? (you can use known seeds and actions to testing particular outcomes)

Peter Hoang · Answer 2 · Fri Apr 12 2024 09:51:00 GMT+0800 (China Standard Time)

@pseudo-rnd-thoughts

For case 1: player 21 vs Dealer BJ and vice versa. Currently, the returned reward is 0 (draw)

Ex code to test for player = BJ + dealer = 21

from gymnasium.envs.toy_text.blackjack import BlackjackEnv, sum_hand 

env = BlackjackEnv(natural=True)
while True: 
    obs, _ = env.reset()
    if obs[0] == 21: # Player BJ
        action = 0
        next_obs, reward, terminated, truncated, info = env.step(action)
        if sum_hand(env.dealer) == 21 and len(env.dealer) > 2: # 21
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == 1.5, reward

Ex code to test for player = 21 & dealer = BJ

    env = BlackjackEnv(natural=True)
    while True: 
        obs, _ = env.reset()
        if sum_hand(env.player) == 21: # ignore player BJ, we want to find 21
            continue 

        action = 1
        next_obs, reward, terminated, truncated, info = env.step(action)
        if (sum_hand(env.player) == 21 and len(env.player) > 2 and # player 21
            sum_hand(env.dealer) == 21 and len(env.dealer) == 2 # dealer BJ
        ): 
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == -1, reward

For case 2: when both busted: I've rechecked and actually the current codes could handle this.
This was my mistake as I was trying to extend the env to support double-down and that case happened to enter this code path:
reward = cmp(score(self.player), score(self.dealer))
which would produce a reward of 0, which is incorrect. But the current codes already handle the busted player case in another path.
Nevertheless, the above line of code still seems a bit "dangerous" as it made me think it would seem to be able to handle all cases.

Mark Towers · Answer 3 · Fri Apr 12 2024 19:48:18 GMT+0800 (China Standard Time)

Could you make a PR with the suggested changes and tests for the relative rules