calibration of current and new ais

Question

calibration of current and new ais

sanderland opened this issue 4 years ago · comments

make an ai that mimics human play and is not just policy based

SimonLewis7407 · Answer 1 · Wed Jun 17 2020 06:31:22 GMT+0800 (China Standard Time)

Hi, I have played against the Calibrated Rank AI in KaTrain version 1.2 six times now, at various strengths (10, 11, 12, 14kyu) and they seem quite good. I used to play against the "P-Pick" settings in version 1.1 but those would always outplay me in the middlegame and then go insane in the endgame. The Calibrated Rank AI is definitely more consistent, less dominant in the middlegame, less crazy in the endgame. More like a human.

The only thing that still feels funny or "artificial" to me now (but I am a weak player) is that during the fuseki the AI tenukis a whole lot. I know players like me have the opposite problem, failing to tenuki enough, but the AI really jumps around.

I am sure there will always be minor details and fine-tuning, but you have achieved a whole lot here. The Calibrated Rank AI represents, as nearly as I can tell, both a strong step forward and a pretty good approximation of what a "real" go game ought to feel like.

Sander Land · Answer 2 · Wed Jun 17 2020 15:27:18 GMT+0800 (China Standard Time)

Yes, this is due to the pick-based algorithm inherently being a bit tenuki happy, more so if there are fewer moves to choose from. I've played the 0-4kyu ones a bit and they feel more balanced in that respect. Nevertheless, the AI sees the opening as having a large number of decent moves, as seen by e.g. opening tengen in a teaching game.

SimonLewis7407 · Answer 3 · Fri Jun 19 2020 05:51:20 GMT+0800 (China Standard Time)

Hi, could you point me to where in the program the code is that calls the "save an SGF file" operations? I would like to see if I can make the Save command write two SGF files -- one is the normal one, and one would be a specific one with just the details that I want, formatted in a particular way, that I can read into a different program.

Sander Land · Answer 4 · Fri Jun 19 2020 15:37:02 GMT+0800 (China Standard Time)

@SimonLewis7407

game.py has write_sgf which sets up some options / filenames and calls
sgf in sgf_parser.py, which has the actual parser and generator, and hooks into
sgf_properties of game_node.py, which modifies the comment to include/exclude info based on settings.

Sander Land · Answer 5 · Tue Jun 23 2020 16:00:23 GMT+0800 (China Standard Time)

set it to the latest 20b for a day to see what happens, the stronger bots run away with their rank a bit. sgf in the sgf_20b directory @bale-go

- 15b model
katrain-10k[11k]
katrain-14k[14k]
katrain-18k[17k]
katrain-2d[3d]
katrain-2k[1d]
katrain-6k[6k]

- 20b model one day
katrain-10k[10k]
katrain-14k[11k]
katrain-18k[17k]
katrain-2d[5d]
katrain-2k[2d]
katrain-6k[6k]

bale-go · Answer 6 · Tue Jun 23 2020 16:24:56 GMT+0800 (China Standard Time)

Interesting. It seems as you get closer to the rank of the pure policy network the correctness of the move rank estimation becomes really important.
If you settle for a larger NN model for katrain, I can recalibrate the calibrated rank bot.

Dontbtme · Answer 7 · Tue Jun 23 2020 16:25:23 GMT+0800 (China Standard Time)

With the 15b, bots with ranks weaker than 18k weren't stable at all, if I remember correctly. Maybe now it would be possible to have consistent bots for beginners, using the 20b net?

bale-go · Answer 8 · Tue Jun 23 2020 16:27:12 GMT+0800 (China Standard Time)

I do not think much would change at the weaker rank region.
The 15b model is strong enough to be a perfect player there.

Sander Land · Answer 9 · Tue Jun 23 2020 16:30:29 GMT+0800 (China Standard Time)

I think at higher strengths you have a lot more chance to hit the top move / near the top move several times in a row and play out a consistent strategy. The lower ones pick like 8 moves? You're kind of stuck choosing the 'least bad' move then, which the policy was definitely not trained for.

Dontbtme · Answer 10 · Tue Jun 23 2020 16:41:34 GMT+0800 (China Standard Time)

True, that's why I still think that if the bot was able to see the whole board with 1 visit per candidate move and chose one that would fit a given rank, that would be ideal.
But you said that wouldn't be practical, and since having stable bots starting from 17kyu or so is already awesome, I'll gladly forget about it and move on ^_^

SimonLewis7407 · Answer 11 · Wed Jun 24 2020 05:30:31 GMT+0800 (China Standard Time)

I'm having a great time playing against Calibrated Rank. I modified the three files that you named, so that when I save an SGF, the normal one is saved and also a pipe-delimited text file that I can read into Excel. The text file shows the moves, points lost, score, etc. A couple of questions about that:

(a) on a game that I won, my average points lost per move was 0.79, and on another game (that I lost) it was 1.10. Don't those seem low? (My opponent was Calibrated Rank at 13kyu, and we seem evenly matched currently. Shouldn't we be losing more points per move on average?)

(b) Another thing is just housekeeping. Within the text files, when you go from a given score, and then add/subtract the impact of a player's move, you should get the next printed score, and so on. But the arithmetic on that is only approximate in the generated file, it fluctuates. And we're not just talking about rounding very small differences. Do you know what might be making the arithmetic from move to move less than exact?

Sander Land · Answer 12 · Wed Jun 24 2020 18:32:05 GMT+0800 (China Standard Time)

yeah, mean point loss seems too low for those ranks. i'm not sure what you mean by (b)

SimonLewis7407 · Answer 13 · Thu Jun 25 2020 01:49:46 GMT+0800 (China Standard Time)

Looking over the game where my mean point loss was just 0.79 it was a pretty "smooth" game. Sometimes at my level, though, there will be a big move that both sides don't notice for a long time, causing every move to lose several points until finally somebody plays the big move. In a case like that our mean point loss per move would be way higher. Maybe the 0.79 is partly justified because neither player had any large messups like that.

Concerning (b), I didn't express it right and now I see more clearly what's up. In the file game_node.py you have an SGF file show "Estimated point loss" for a move whenever "if sgf and points_lost > 0.5" is true. I kept that but for my second output file, I wanted an "Estimated point loss" no matter what, so I changed that condition to "if sgf and points_lost != 999.000"
There are two results of that. One is that the mean point loss will be understated compared to regular Katrain, since there are some negative point losses thrown in. The other thing is that I guess the arithmetic from move to move (current estimated score, minus the current move's point loss, should always equal the next current estimated score) will no longer be perfect, just close.
I found that around 20% of moves had a small "negative" point loss.

Not sure if any of this is of interest or helpful, but I'm mentioning it just in case there's something useful there.

bale-go · Answer 14 · Fri Jun 26 2020 05:34:27 GMT+0800 (China Standard Time)

I recalibrated the calibrated rank bot to the final 20b model. At least 30 games were played at each rank.
At >3 kyu (weaker than 3 kyu) the #_of_moves_seen by KataGo did not change between models 15b and 20b.
At <3 kyu less moves was sufficient for the stronger policy net to play an even game with various pachi bots.

Even games with different bots at different total number of moves seen by katago policy net 20b model.
GnuGo 3.8 at level 10 (8 kyu): 30
pachi -t =5000 --nodcnn (3 kyu): 66
pachi -t =5000:15000 (3 dan): 100
pachi -t =5000:15000 with 2 handicap stones and 0.5 komi (ca. 5 dan): 118

Blue line/points are 15b, magenta line/points are 20b.

The calibration for 20b can be divided into two regions:

kyu-rank > 1k: int(round(10^(-0.05737 kyu + 1.9482)))
kyu-rank < 1k: int(round(10^(-0.03585 kyu + 1.9284)))

Sander Land · Answer 15 · Fri Jun 26 2020 16:30:06 GMT+0800 (China Standard Time)

I seriously doubt you can divide a fit on 4 points of noisy data into 2 regions. Happy to see it reasonable consistent though.
Could you try some runs down to 15k or does gnugo not go that low?

bale-go · Answer 16 · Fri Jun 26 2020 16:50:17 GMT+0800 (China Standard Time)

I assumed that at lower ranks the calibration stays the same as 15b.
GnuGo does not get weaker much more, unless we use large handicaps.

But there is a more pressing concern.
I plotted the outlier free mean of move ranks vs. the # of legal moves on board for calibrated rank bot and users.
It seems that what works at lower ranks might not work for stronger players.
Namely, at stronger levels the outlier free mean of move ranks does not decrease over the game as much (or at all).

I'm working on a model to take this into account, both for the next calibrated rank bot and the rank estimation.

bale-go · Answer 17 · Fri Jun 26 2020 23:54:15 GMT+0800 (China Standard Time)

I used the user data to estimate the change in the outlier free move rank.
I used the regression that had the lowest AIC value.

The new bot uses the overall kyu rank estimation from the tested calibrated rank bot (first part of the equation) but I modified the shape of the curve to mimic human play better (starting with 0.3116).
(0.0630149 + 0.762399 * board_size/(10**(-0.05737*kyu_rank+1.9482))) * (0.31164467+0.55726218*(n_legal_moves/board_size)*np.exp(-1*(3.0308747*(n_legal_moves/board_size)*(n_legal_moves/board_size)-(n_legal_moves/board_size)-0.045792218*kyu_rank-0.31164467)**2)-0.0064860256*kyu_rank)

The equation should scale to various board sizes, although it was only tested in 19x19.
The new bot should be relatively stronger in the opening, and it does not become superhuman by the endgame (previous calibrated rank bot at 2 kyu did not make any mistakes at 100 legal moves left).
I played against it a few times I feel it much more balanced, but I'm afraid I am biased :)
I created a PR with the user data based AI.

Sander Land · Answer 18 · Sat Jun 27 2020 04:09:22 GMT+0800 (China Standard Time)

How do you relate the outlier free mean to the n_moves + override that you give the bot?

bale-go · Answer 19 · Sat Jun 27 2020 05:03:54 GMT+0800 (China Standard Time)

Outlier free mean: OFM
Number of moves seen by katago: NMSK
Number of legal moves on board: NLMB

OFM = 0.0630149 + 0.762399 * NLMB/NMSK
or
NMSK = NLMB/(1.31165*OFM - 0.08265)

If you are interested, I can upload the data (100000 runs for each NMSK, NLMB pairs to get the OFM) I used for the symbolic regression.

Sander Land · Answer 20 · Sat Jun 27 2020 05:08:22 GMT+0800 (China Standard Time)

I don't understand NMSK, but as long as you took this into account, let's give it a spin on OGS

SimonLewis7407 · Answer 21 · Sat Jun 27 2020 05:11:52 GMT+0800 (China Standard Time)

May I ask what "outlier free mean" means? Does it mean the mean, ignoring a certain arbitrary percent of the highest and lowest values? Or does it ignore any values that are more than, say, 3 standard deviations from the mean? Something like that?

bale-go · Answer 22 · Sat Jun 27 2020 05:12:53 GMT+0800 (China Standard Time)

@sanderland In other words p:pick algorithm sees only NMSK from total NLMB.
For example an 8k p:pick sees only 30 moves (NMSK = 30)
A 3k p:pick sees ca. 60 moves (NMSK = 60)

@SimonLewis7407 It ignores the best and worst 20%

Sander Land · Answer 23 · Sat Jun 27 2020 05:34:15 GMT+0800 (China Standard Time)

@SimonLewis7407 it's some compromise between mean and median @bale-go likes to use. I think possibly just using median could be better insofar it's not a new invention and very close to this anyway.
@bale-go The length and complexity of the equations you are generating are getting a bit too long for my liking, it's very hard to see what the asymtotic behaviour is and so on and whether there's a divide by zero waiting to happen. I'd appreciate an attempt to simplify the equation to be more human readable/understandable. If that's hard, at least introduce a helper var or two and cut some insignificant digits
Regardless, the last equation is on OGS now, we'll see what it does :)

bale-go · Answer 24 · Sat Jun 27 2020 05:53:05 GMT+0800 (China Standard Time)

Yes, median was my first choice too. The problem with it is that it can only be an integer. At higher ranks there is a significant difference between mean move rank of 2.6 and 3.4, but the median would give 3.0 to both.

The complexity has definitely increased. I checked the equations with random sampling of the input variables and they behaved well.
The main issue I see now is with the rank estimation at the end of the game. As you predicted it earlier:

yep, and in endgame the number of moves is small and the best one can be really obvious, so calling someone a 9d over doing hane-connect a few times is also tricky.

A lot of games end with >5d rank estimates due to that.

Sander Land · Answer 25 · Sat Jun 27 2020 07:22:42 GMT+0800 (China Standard Time)

I suggest we cap our rank estimate at 4d and just show a '4d+' label if the top end is that.
also my usual "black -l 120" did some fun things to your new equation in ai.py ;)

bale-go · Answer 26 · Sat Jun 27 2020 18:56:13 GMT+0800 (China Standard Time)

I made the equation cleaner.
Also, I had to remove the obvious moves from moves.csv since they distorted the mean move rank.

Sander Land · Answer 27 · Sat Jun 27 2020 19:41:25 GMT+0800 (China Standard Time)

That looks great!

Sander Land · Answer 28 · Thu Jul 02 2020 15:42:14 GMT+0800 (China Standard Time)

Self-play tournament shows policy is a lot stronger than the 4d calibrated rank, probably since they can't exploit the other's lack of reading the way humans can.

(kyu rank vs elo, first point is policy, others calibrated rank)

bale-go · Answer 29 · Thu Jul 02 2020 15:45:56 GMT+0800 (China Standard Time)

Nice!
The linearity of the plot (except for the first point) is really convincing.

Sander Land · Answer 30 · Wed Jul 08 2020 01:37:26 GMT+0800 (China Standard Time)

All released - the strength models for all the AIs are a bit rushed, so probably there's quite a bit of room for improvement.

portkata · Answer 31 · Tue Nov 03 2020 04:44:54 GMT+0800 (China Standard Time)

@sanderland where in the code are you able to prevent the calibrated bot from making illegal moves for the selected rule set. The mobile version is great, but being unable to exclude illegal moves for non tromp taylor rulesets from the pool of possible moves might be making it just a hair weaker than your version. Thanks!

Sander Land · Answer 32 · Tue Nov 03 2020 06:18:02 GMT+0800 (China Standard Time)

@sanderland where in the code are you able to prevent the calibrated bot from making illegal moves for the selected rule set. The mobile version is great, but being unable to exclude illegal moves for non tromp taylor rulesets from the pool of possible moves might be making it just a hair weaker than your version. Thanks!

the policy should be -1 for illegal moves.

Sander Land · Answer 33 · Sun Jan 17 2021 02:12:33 GMT+0800 (China Standard Time)

simplified still needs this, but compute is too low to do it