sanderland / katrain

Improve your Baduk skills by training with KataGo!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

calibration of current and new ais

sanderland opened this issue · comments

make an ai that mimics human play and is not just policy based

Hi, I have played against the Calibrated Rank AI in KaTrain version 1.2 six times now, at various strengths (10, 11, 12, 14kyu) and they seem quite good. I used to play against the "P-Pick" settings in version 1.1 but those would always outplay me in the middlegame and then go insane in the endgame. The Calibrated Rank AI is definitely more consistent, less dominant in the middlegame, less crazy in the endgame. More like a human.

The only thing that still feels funny or "artificial" to me now (but I am a weak player) is that during the fuseki the AI tenukis a whole lot. I know players like me have the opposite problem, failing to tenuki enough, but the AI really jumps around.

I am sure there will always be minor details and fine-tuning, but you have achieved a whole lot here. The Calibrated Rank AI represents, as nearly as I can tell, both a strong step forward and a pretty good approximation of what a "real" go game ought to feel like.

Yes, this is due to the pick-based algorithm inherently being a bit tenuki happy, more so if there are fewer moves to choose from. I've played the 0-4kyu ones a bit and they feel more balanced in that respect. Nevertheless, the AI sees the opening as having a large number of decent moves, as seen by e.g. opening tengen in a teaching game.

Hi, could you point me to where in the program the code is that calls the "save an SGF file" operations? I would like to see if I can make the Save command write two SGF files -- one is the normal one, and one would be a specific one with just the details that I want, formatted in a particular way, that I can read into a different program.

@SimonLewis7407

  • game.py has write_sgf which sets up some options / filenames and calls
  • sgf in sgf_parser.py, which has the actual parser and generator, and hooks into
  • sgf_properties of game_node.py, which modifies the comment to include/exclude info based on settings.

set it to the latest 20b for a day to see what happens, the stronger bots run away with their rank a bit. sgf in the sgf_20b directory @bale-go

- 15b model
katrain-10k[11k]
katrain-14k[14k]
katrain-18k[17k]
katrain-2d[3d]
katrain-2k[1d]
katrain-6k[6k]

- 20b model one day
katrain-10k[10k]
katrain-14k[11k]
katrain-18k[17k]
katrain-2d[5d]
katrain-2k[2d]
katrain-6k[6k]

Interesting. It seems as you get closer to the rank of the pure policy network the correctness of the move rank estimation becomes really important.
If you settle for a larger NN model for katrain, I can recalibrate the calibrated rank bot.

With the 15b, bots with ranks weaker than 18k weren't stable at all, if I remember correctly. Maybe now it would be possible to have consistent bots for beginners, using the 20b net?

I do not think much would change at the weaker rank region.
The 15b model is strong enough to be a perfect player there.

I think at higher strengths you have a lot more chance to hit the top move / near the top move several times in a row and play out a consistent strategy. The lower ones pick like 8 moves? You're kind of stuck choosing the 'least bad' move then, which the policy was definitely not trained for.

True, that's why I still think that if the bot was able to see the whole board with 1 visit per candidate move and chose one that would fit a given rank, that would be ideal.
But you said that wouldn't be practical, and since having stable bots starting from 17kyu or so is already awesome, I'll gladly forget about it and move on ^_^

I'm having a great time playing against Calibrated Rank. I modified the three files that you named, so that when I save an SGF, the normal one is saved and also a pipe-delimited text file that I can read into Excel. The text file shows the moves, points lost, score, etc. A couple of questions about that:

(a) on a game that I won, my average points lost per move was 0.79, and on another game (that I lost) it was 1.10. Don't those seem low? (My opponent was Calibrated Rank at 13kyu, and we seem evenly matched currently. Shouldn't we be losing more points per move on average?)

(b) Another thing is just housekeeping. Within the text files, when you go from a given score, and then add/subtract the impact of a player's move, you should get the next printed score, and so on. But the arithmetic on that is only approximate in the generated file, it fluctuates. And we're not just talking about rounding very small differences. Do you know what might be making the arithmetic from move to move less than exact?

yeah, mean point loss seems too low for those ranks. i'm not sure what you mean by (b)

Looking over the game where my mean point loss was just 0.79 it was a pretty "smooth" game. Sometimes at my level, though, there will be a big move that both sides don't notice for a long time, causing every move to lose several points until finally somebody plays the big move. In a case like that our mean point loss per move would be way higher. Maybe the 0.79 is partly justified because neither player had any large messups like that.

Concerning (b), I didn't express it right and now I see more clearly what's up. In the file game_node.py you have an SGF file show "Estimated point loss" for a move whenever "if sgf and points_lost > 0.5" is true. I kept that but for my second output file, I wanted an "Estimated point loss" no matter what, so I changed that condition to "if sgf and points_lost != 999.000"
There are two results of that. One is that the mean point loss will be understated compared to regular Katrain, since there are some negative point losses thrown in. The other thing is that I guess the arithmetic from move to move (current estimated score, minus the current move's point loss, should always equal the next current estimated score) will no longer be perfect, just close.
I found that around 20% of moves had a small "negative" point loss.

Not sure if any of this is of interest or helpful, but I'm mentioning it just in case there's something useful there.

I recalibrated the calibrated rank bot to the final 20b model. At least 30 games were played at each rank.
At >3 kyu (weaker than 3 kyu) the #_of_moves_seen by KataGo did not change between models 15b and 20b.
At <3 kyu less moves was sufficient for the stronger policy net to play an even game with various pachi bots.

Even games with different bots at different total number of moves seen by katago policy net 20b model.
GnuGo 3.8 at level 10 (8 kyu): 30
pachi -t =5000 --nodcnn (3 kyu): 66
pachi -t =5000:15000 (3 dan): 100
pachi -t =5000:15000 with 2 handicap stones and 0.5 komi (ca. 5 dan): 118

Blue line/points are 15b, magenta line/points are 20b.
15b_20b_model

The calibration for 20b can be divided into two regions:

kyu-rank > 1k: int(round(10^(-0.05737 kyu + 1.9482)))
kyu-rank < 1k: int(round(10^(-0.03585 kyu + 1.9284)))

I seriously doubt you can divide a fit on 4 points of noisy data into 2 regions. Happy to see it reasonable consistent though.
Could you try some runs down to 15k or does gnugo not go that low?

I assumed that at lower ranks the calibration stays the same as 15b.
GnuGo does not get weaker much more, unless we use large handicaps.

But there is a more pressing concern.
I plotted the outlier free mean of move ranks vs. the # of legal moves on board for calibrated rank bot and users.
It seems that what works at lower ranks might not work for stronger players.
Namely, at stronger levels the outlier free mean of move ranks does not decrease over the game as much (or at all).
move_rank_bots_vs_users

I'm working on a model to take this into account, both for the next calibrated rank bot and the rank estimation.

I used the user data to estimate the change in the outlier free move rank.
I used the regression that had the lowest AIC value.
move_rank_bots_vs_users_new_calib

The new bot uses the overall kyu rank estimation from the tested calibrated rank bot (first part of the equation) but I modified the shape of the curve to mimic human play better (starting with 0.3116).
(0.0630149 + 0.762399 * board_size/(10**(-0.05737*kyu_rank+1.9482))) * (0.31164467+0.55726218*(n_legal_moves/board_size)*np.exp(-1*(3.0308747*(n_legal_moves/board_size)*(n_legal_moves/board_size)-(n_legal_moves/board_size)-0.045792218*kyu_rank-0.31164467)**2)-0.0064860256*kyu_rank)

The equation should scale to various board sizes, although it was only tested in 19x19.
The new bot should be relatively stronger in the opening, and it does not become superhuman by the endgame (previous calibrated rank bot at 2 kyu did not make any mistakes at 100 legal moves left).
I played against it a few times I feel it much more balanced, but I'm afraid I am biased :)
I created a PR with the user data based AI.

How do you relate the outlier free mean to the n_moves + override that you give the bot?

Outlier free mean: OFM
Number of moves seen by katago: NMSK
Number of legal moves on board: NLMB

OFM = 0.0630149 + 0.762399 * NLMB/NMSK
or
NMSK = NLMB/(1.31165*OFM - 0.08265)

If you are interested, I can upload the data (100000 runs for each NMSK, NLMB pairs to get the OFM) I used for the symbolic regression.

I don't understand NMSK, but as long as you took this into account, let's give it a spin on OGS

May I ask what "outlier free mean" means? Does it mean the mean, ignoring a certain arbitrary percent of the highest and lowest values? Or does it ignore any values that are more than, say, 3 standard deviations from the mean? Something like that?

@sanderland In other words p:pick algorithm sees only NMSK from total NLMB.
For example an 8k p:pick sees only 30 moves (NMSK = 30)
A 3k p:pick sees ca. 60 moves (NMSK = 60)

@SimonLewis7407 It ignores the best and worst 20%

@SimonLewis7407 it's some compromise between mean and median @bale-go likes to use. I think possibly just using median could be better insofar it's not a new invention and very close to this anyway.
@bale-go The length and complexity of the equations you are generating are getting a bit too long for my liking, it's very hard to see what the asymtotic behaviour is and so on and whether there's a divide by zero waiting to happen. I'd appreciate an attempt to simplify the equation to be more human readable/understandable. If that's hard, at least introduce a helper var or two and cut some insignificant digits
Regardless, the last equation is on OGS now, we'll see what it does :)

Yes, median was my first choice too. The problem with it is that it can only be an integer. At higher ranks there is a significant difference between mean move rank of 2.6 and 3.4, but the median would give 3.0 to both.

The complexity has definitely increased. I checked the equations with random sampling of the input variables and they behaved well.
The main issue I see now is with the rank estimation at the end of the game. As you predicted it earlier:

yep, and in endgame the number of moves is small and the best one can be really obvious, so calling someone a 9d over doing hane-connect a few times is also tricky.

A lot of games end with >5d rank estimates due to that.

I suggest we cap our rank estimate at 4d and just show a '4d+' label if the top end is that.
also my usual "black -l 120" did some fun things to your new equation in ai.py ;)

I made the equation cleaner.
Also, I had to remove the obvious moves from moves.csv since they distorted the mean move rank.
move_rank_bots_vs_users2

That looks great!

Self-play tournament shows policy is a lot stronger than the 4d calibrated rank, probably since they can't exploit the other's lack of reading the way humans can.
image
(kyu rank vs elo, first point is policy, others calibrated rank)

Nice!
The linearity of the plot (except for the first point) is really convincing.

All released - the strength models for all the AIs are a bit rushed, so probably there's quite a bit of room for improvement.

@sanderland where in the code are you able to prevent the calibrated bot from making illegal moves for the selected rule set. The mobile version is great, but being unable to exclude illegal moves for non tromp taylor rulesets from the pool of possible moves might be making it just a hair weaker than your version. Thanks!

@sanderland where in the code are you able to prevent the calibrated bot from making illegal moves for the selected rule set. The mobile version is great, but being unable to exclude illegal moves for non tromp taylor rulesets from the pool of possible moves might be making it just a hair weaker than your version. Thanks!

the policy should be -1 for illegal moves.

simplified still needs this, but compute is too low to do it