indylab / nxdo

Deep RL Code for XDO: A Double Oracle Algorithm for Extensive-Form Games

Home Page:https://arxiv.org/abs/2103.06426

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Porting to No Limit Texas Hold 'em

TensorHusker opened this issue · comments

Hello. I was toying with this code recently and have found its performance impressive. I was curious about how it might perform on No Limit Texas Hold 'em. Could this code tractably scale from Leduc and Kuhn to this much larger game? If so, how difficult would such a port be from the codebase as written?

Hey there! We haven't attempted it yet, but our guess is that NXDO should scale well to No-Limit Texas Hold'em compared to other model-free methods.

Because of the large stack size, there are a huge amount of relatively similar pure strategies. If we can get away with mixing among actions taken by a small subset of pure strategies throughout the game-tree, NXDO should be able to create an easier-to-solve abstraction of the game with its extensive-form restricted game.

The high-level steps to set up NXDO with a new environment would be the following. Making some new debugging/hyperparam search scripts would likely be necessary:

  • Make a new RLLib MultiAgentEnv for No-Limit Texas Hold'em like the ones present in our codebase.

  • Use RLLib/Tune to find a good RL best response algorithm and parameters for it (the choice of opponent for tuning, random or somewhat competent, may matter).
    My guess is that using PPO or SAC with a continuous action space, which the environment could then map to legal discrete actions, would do well since there are so many similar betting amounts.
    You may also want to have a custom RLLib model to input the observations in a special way or to mask out invalid actions, etc.

  • Create a fixed population of policies to make a non-changing restricted game for debugging, potentially made with PSRO or multiple self-play runs. Use this debugging restricted game to find good parameters for the extensive-form restricted game solver (NFSP) that can converge, quickly if possible.
    You can also try-out new metasolver params by just running the actual NXDO algorithm, but doing a hyperparam search this way will be an inefficient process on a large game.

  • Create a new NXDOScenario instance to define an NXDO experiment for your env, RL BR algorithm choice/params, restricted game metasolver algorithm choice/params, as well as stopping conditions for BRs and metasolvers.

Sounds great. I may fork it and start doing this. A couple of things I was wondering about:

  • Seeing as this method scales much better relative to previous methods, would you say that training an agent for 6-9 players would be tractable? I believe this is the case, but I wanted to clarify.
  • Regarding player counts, I am a relative novice to RL, so I was wondering about something pertaining to multiplayer RL more generally. Can a trained model be applied to any number of players, or would the different state spaces prevent this? If that is the case, could a solution be to train it for an "upper bound" number of players and merely "ignore" missing players in the state space? I ask mainly because I was wondering how adaptable a single model would be, considering the variable (but capped) number of players in poker.
  • This last question is more related to the research than the code. This appears to be SOTA in equilibrium, mixed-strategy optimal play for poker. However, there is another line of research (code here) concerning opponent exploitation, which naturally deviates from minimally exploitable strategies to maximize win rate against flawed opponents. To what extent could such a system be incorporated into equilibrium-based approaches? Might there be ways to both retain an essentially unexploitable strategy while also dynamically deviating when opportunities arise?

Sorry for the late reply again. Addressing each point:

  • NXDO should scale well in games with large or continuous actions spaces where compressing (or abstracting) the game to only use a subset of the available actions at each infostate can still allow a strong mixed behavioral strategy in the full game. It's designed to find Nash Equilibrium in a two-player zero-sum games, and solving general-sum, n-player games are currently out of scope for our work. Often in these n-player games, there are multiple NE without a clear best choice to choose from, and finding a one-size-fits-all solution concept is an open research question. At the very least, you would have to swap out our default metasolver, NFSP, with something that arrives at a solution concept you like in the >2 player game. For PSRO, this work (Marris, et al 2021) https://arxiv.org/pdf/2106.09435.pdf is related to solving more than 2-player games, optimizing for coarse correlated equilibrium, although this solution concept is one of multiple plausible objectives.

  • In terms of a neural network input being compatible with variable player amounts, that would be a question of formatting your environment's observation space. Specifically for poker, assuming you want to observe the complete game history, training with an upper bound number of players might be a good option.

  • NXDO should scale favorably to solve for approximate Nash Equilibrium compared to other current model-free methods in certain high-dimensional games (see the first bullet point). Looking only at poker, the SOTA is probably model-based. I think it might be along the lines of Brown and/or Sandholm's work, although there may be more recent work than this https://arxiv.org/pdf/2007.13544.pdf, https://par.nsf.gov/servlets/purl/10119653 . Relating to deviating from NE to better exploit an opponent, you could only safely do so if you have a prior belief that the other opponent is suboptimal. Deviating from an NE strategy will open you up to being exploited. For opponent modeling, etc to do so, this work and this work seem relevant.

Interesting. Thanks for the reply again. I still plan on experimenting with the repo some more, but I was wondering: could you tell me more about n-player games? I'm strong on my RL/AI side, but I'm still brushing up on my game theory. I remember Pluribus a couple of years ago, and I found this paper by tracing advancements/utilizations of DeepCFR. I was at first hoping that this paper could allow it to be improved, but now I wonder how that can be improved.

  • What kinds of methods are used in n-player imperfect information games? In particular, the types of games that relate to multiplayer Texas Hold 'em.
  • What kind of equilibria exist for n-player games in contrast to zero-sum two-player games?
  • Do you have any more literature on this area I could peruse?
  • Are there potential avenues for this work to be used to improve upon n-player imperfect information game learning? EDIT: I realize now that you discussed this in your reply. I suppose a better way of phrasing it is what does this work, overall, aim to create? My own read-through suggested to me that this is an overall framework that can be used to find certain equilibria more efficiently on larger games (particularly with larger action spaces). What it optimizes for is dictated by the solvers it is used with. Is this an overall correct way of describing the work at a bird's eye view? If so, I find this a helpful definition to help me delineate it from, say, the subgame solvers themselves, which helps refine my search and prototyping.
  • Combining JPSRO and their meta-game solver with NXDO could be interesting.

Hey! Been a bit. Getting back to you:

- What kind of equilibria exist for n-player games in contrast to zero-sum two-player games?

It's very much an open problem to find a universal solution concept for general-sum n-player games.

Nash Equilibria still exist for n-player games. However, unlike with 2-player zero-sum games, Nash Equilibria strategies in n-player games aren't interchangeable in terms of expected payoff. This means that if a group of players who independently calculated their own NE strategies were to match up, their joint strategy would not necessarily be an NE, and performance could be arbitrarily bad. If you want to learn more, this is called the equilibrium selection problem.

Alternate, more general concepts that you might want to compute for an n-player game include Correlated Equilibrium and Coarse Correlated Equilibrium, in which a randomized device is used to suggest which action profile players should adhere to each episode. The J-PSRO work we mentioned solves for this, and it's a useful solution that isn't universally applicable for every use case.

- What kinds of methods are used in n-player imperfect information games? In particular, the types of games that relate to multiplayer Texas Hold 'em.

So aside from the heuristically well-performing CFR method that Pluribus used (all 6-players use counterfactual regret minimization) and (C)CE, there aren't very many general methods that have useful guarantees for larger games. Self-play has no guarantees but can heuristically do well depending on the game.

- Do you have any more literature on this area I could peruse?

There's not a huge amount of literature that I'm aware of as most recent progress has been limited to 2-player zero-sum games.
Another Deepmind work used a Fictitious Play-based approach to solve for an approximate course correlated equilibrium in 7-player No-Press Diplomacy (Anthony & Eccles et al 2022) https://arxiv.org/pdf/2006.04635.pdf.

Two great game theory reference books that I'd recommend are:
Multiagent systems, Shoham & Leyton-Brown
Game Theory, Fudenberg and Tirole

- Are there potential avenues for this work to be used to improve upon n-player imperfect information game learning? EDIT: I realize now that you discussed this in your reply. I suppose a better way of phrasing it is what does this work, overall, aim to create? My own read-through suggested to me that this is an overall framework that can be used to find certain equilibria more efficiently on larger games (particularly with larger action spaces). What it optimizes for is dictated by the solvers it is used with. Is this an overall correct way of describing the work at a bird's eye view? If so, I find this a helpful definition to help me delineate it from, say, the subgame solvers themselves, which helps refine my search and prototyping.

The main idea of our work is that PSRO (generally used for 2-player games) has a worst-case time complexity that's exponential in the number of game infostates. We fix this by using an extensive-form metasolver instead of a normal-form metasolver. Our worst-case time complexity is linear in the number of game states. This is applicable to large games since we should scale better in the worst-case. Technically you could use metasolvers that solve for something other than Nash Equilibrium, which would be future work.

- Combining JPSRO and their meta-game solver with NXDO could be interesting.

Agreed, although you'd need an extensive-form (C)CE metasolver for XDO/NXDO. Might be a neat future direction!

I'm going to go ahead and close this issue assuming that's ok.