Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)

Home Page:https://gymnasium.farama.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug Report] Calculating rewards for Blackjack toy-text env

peterhungh3 opened this issue · comments

Describe the bug

In blackjack.py: reward = cmp(score(self.player), score(self.dealer))

Current rewards seem wrong for 2 edge cases:

  1. Player: Blackjack vs Dealer: 21. Currently this return a draw (reward = 0) while it should be 1 (b/c player wins).
  2. Player and dealer both bust: Currently this return a draw (reward = 0) while it should be -1 (dealer wins).

Code example

In blackjack.py: 
Inside function step():
reward = cmp(score(self.player), score(self.dealer))

And definition of score() and cmp(): 
def score(hand):  # What is the score of this hand (0 if bust)
    return 0 if is_bust(hand) else sum_hand(hand)
def cmp(a, b):
    return float(a > b) - float(a < b)

System info

  • Gymnasium, main branch
  • Python 3.9

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo

I'm surprised that no one has noticed this issue with the reward function though admittedly on edge cases

What is your suggested changes to the reward in terms of code?
Could you provide some testing code for this behaviour? (you can use known seeds and actions to testing particular outcomes)

@pseudo-rnd-thoughts

For case 1: player 21 vs Dealer BJ and vice versa. Currently, the returned reward is 0 (draw)

Ex code to test for player = BJ + dealer = 21

from gymnasium.envs.toy_text.blackjack import BlackjackEnv, sum_hand 

env = BlackjackEnv(natural=True)
while True: 
    obs, _ = env.reset()
    if obs[0] == 21: # Player BJ
        action = 0
        next_obs, reward, terminated, truncated, info = env.step(action)
        if sum_hand(env.dealer) == 21 and len(env.dealer) > 2: # 21
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == 1.5, reward

Ex code to test for player = 21 & dealer = BJ

    env = BlackjackEnv(natural=True)
    while True: 
        obs, _ = env.reset()
        if sum_hand(env.player) == 21: # ignore player BJ, we want to find 21
            continue 

        action = 1
        next_obs, reward, terminated, truncated, info = env.step(action)
        if (sum_hand(env.player) == 21 and len(env.player) > 2 and # player 21
            sum_hand(env.dealer) == 21 and len(env.dealer) == 2 # dealer BJ
        ): 
            print(f"Player: {env.player} & dealer = {env.dealer}, " 
                f"reward = {reward}")
            assert reward == -1, reward

For case 2: when both busted: I've rechecked and actually the current codes could handle this.
This was my mistake as I was trying to extend the env to support double-down and that case happened to enter this code path:
reward = cmp(score(self.player), score(self.dealer))
which would produce a reward of 0, which is incorrect. But the current codes already handle the busted player case in another path.
Nevertheless, the above line of code still seems a bit "dangerous" as it made me think it would seem to be able to handle all cases.

Could you make a PR with the suggested changes and tests for the relative rules