Incorrectly exploring worst possible move for player 2?

Question

Incorrectly exploring worst possible move for player 2?

esparano opened this issue 5 years ago · comments

Hi! I am having some issues running this library, and I would like some clarification about how/why this implementation works.

It appears that, in this MCTS implementation, the getReward() function (https://github.com/pbsinclair42/MCTS/blob/master/naughtsandcrosses.py) does not vary based on the "perspective" of the player (i.e. which player is currently selecting a move). This means that a reward of 1 means first-player wins, while a reward of -1 means second-player wins. The backpropagation step sums this reward to get the totalReward for a node.

However, if you look at getBestChild() (https://github.com/pbsinclair42/MCTS/blob/master/mcts.py), the following lines are worrying to me:

nodeValue = child.totalReward / child.numVisits + explorationValue * math.sqrt(
2 * math.log(node.numVisits) / child.numVisits)
if nodeValue > bestValue:

Doesn't this mean that, regardless of who the current player is, this function is choosing to explore the child node that is best for player 1 only? Where is the minimax aspect of this algorithm, where the reward is "flipped/inverted" based on which player is playing?

Usually, MCTS implementations will choose one of two strategies:

make the reward relative to the current player (so that +1 means current player wins, not specifically player 1 or player 2 wins), so that at each node the reward result of the rollout might be added or subtracted when summing each node's totalReward depending on which agent is represented by that node.
keep the reward relative to player 1 (so +1 is always good for player 1), but allow player 2 to minimize that value instead of maximize.

I think that some people choose 1) because it's more general and can apply to games with 1, 2, 3+ players, whereas 2) is limited to two-player adversarial games.

kinoc · Answer 1 · Sun Oct 20 2019 14:19:47 GMT+0800 (China Standard Time)

I think in this implementation everything is relative to the root node player/action selector, and in particular the final evaluation function. It assumes the final evaluation is able to perform all the min/max's up to the point of final evaluation, and the top level nodeValues will converge to the expected average final reward for that action for player 1 whatever their evaluation function might be. The eventual eval function could be looking for max/max (win/win) over multiple players instead of min/max (win/lose). So it is more of a general action selection optimizer than specifically for min/max games (but can be used for them), and that just happens to fit in my use case.
The one possible "solution" would be to embed a "player eval variable" in the state, to either provide the toggle with each level (a float multiplied by -1 for each switch in sides, and fractional for friends or friends of enemies), or a list of evaluation operations to apply to the final static evaluation to get something closer to your desired value.

esparano · Answer 2 · Thu Oct 24 2019 01:40:10 GMT+0800 (China Standard Time)

That would make sense as a possible implementation, (#1 in my examples), but I don't think that's what this code actually does. If you look at the "backpropagate" function, a single reward is given to all nodes, regardless of which agent is at play.

harrdb12 · Answer 3 · Mon Mar 30 2020 04:49:12 GMT+0800 (China Standard Time)

I agree with @esparano on this; there is definitely something wrong with the way it handles reward and the current player. As it is, it always seems to optimize for player 1, causing it to play correctly if it has the first move, but causing it to intentionally try to lose if it goes 2nd.

Right now I'm looking into a fix for this, either editing mcts.py or the GetReward function in the Naughts&Crosses example. I think there's also a similar issue with the default value of the exploration constant, causing it to assume that the opponent is going to make bad moves. For example, in the Naught&Crosses game, the AI will often ignore blocking an opponent who is about to win in favor of setting up its own win in the future.

I'm working on thinking through some fixes for these, and will likely create an issue soon. I'd also like to think about how this can be expanded to handle non-perfect information games, like simple card games. Any help at that time would be appreciated.

Paul Sinclair · Answer 4 · Tue Mar 31 2020 01:11:27 GMT+0800 (China Standard Time)

Thanks all for the feedback, you're quite correct there's an issue here as identified. The easy fix is what's in @harrdb12 's pull request, however this does then limit the usage to only minimax games. I've merged that for now to fix the immediate bug, but will also look into adapting the library to fix this for n-player games too.

esparano · Answer 5 · Tue Apr 14 2020 02:58:54 GMT+0800 (China Standard Time)

@pbsinclair42 @harrdb12 As mentioned, the suggested fix only works for 2-player adversarial games. The more general approach for n-player games is to have a function (agent, node) -> reward. This way, the agent could not only return a positive when agent.id == node.state.currentPlayer (or negative when !=), but could do more advanced logic like agent.team == node.state.currentPlayer.team. Or even just always returning a positive value for 1-player games.

Basically, it's up to the agent themselves to determine if the value of the node is a 'good" or "bad".

esparano · Answer 6 · Tue Apr 14 2020 03:05:58 GMT+0800 (China Standard Time)

I think a simple fix would be to modify
def getReward(self):
to something like:
def getReward(self, agentPerspective):

Edit: I have to think a bit more about the actual implementation but this is generally how I see it done.

harrdb12 · Answer 7 · Thu Apr 16 2020 20:14:40 GMT+0800 (China Standard Time)

Hey @esparano , thanks for the feedback. I'm tinkering around with AI in another game right now (in my spare time), but plan to come back and make some modifications to this repo at some point once I have come up with some good ideas for improvements. I'll be sure to look into your suggestions as well to make it more general when I do.