leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A new way to produce stronger weight?

2532796145 opened this issue · comments

We tried to combine 2 or 3 stronge weight by simply “add them together”:
We picked up 257aeeb8 (the strongest one by now on http://zero.sjeng.org/ ) and some other weight files which won over 40% to 257aeeb8 in SPRT. We made some “hybrid” weight files by
linear superposition——0.5weight1+0.5weight2; 0.25weight1+0.25weight2+0.5*weight3… Surprisingly, we’ve got serval weight files much stronger than 257aeeb8… Here are two of them. Both of these "hybrid" files can win ~70% of the matches to 257aeeb8 (1600playouts).

weight1.zip
weight3.zip

Now LZ-halfblood-W1-P1600 and LZ-halfblood-W3-p1600 are testing on cgos.

commented

interesting

The match between weight1 and 257aeeb8 is ongoing, and weight1 currently leads 27-16

So can it be concluded that the learning rate is too high for now?

Mabey it has something like "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"--https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

This is a good idea. Can you share the source code about how to do this?
We are working on the mix of the weights by a more "scientific" way in that the weights in each layer will be diagonized to generate eigenvalues and eigenvectors. We will then generate symmetrized weights in conformity with the inherent symmetry of the game. Next, we will explore the "entanglement" between the layers and see if we can find a better mixing formula.

I think the way is very simlpe like this :wfmix=[(wf1+wf2)/2

@larrywang30092 I'm sure the code will definitely disappointed you. It's far simpler than you may think.

w1 = open('oldWeight1.txt', 'rt')
w2 = open('oldWeight2.txt', 'rt')
weight = open('newWeight.txt', 'wt')

n = 0
while n<67:
    v1 = [float(x) for x in w1.readline().split()]
    v2 = [float(x) for x in w2.readline().split()]
    if n==0: weight.write('1')
    else:
        weight.write('\n')
        for i,x in enumerate(v1):
            weight.write(f'{(x + v2[i]) / 2} ')
    n += 1
print('Finished.')
weight.close()

Thank you, @MingWR

I think the actual effect of averaging the two weights is reducing the noise. Maybe we introduced too much noise in self play games?

I don't think noise is the problem. I think that each of the current strong weight files has particular weak points, where the policy priors don't include the correct response in some situations, and those situations are different for each weight file. The averaging of two similarly strong networks would make sure that the resulting policy priors include the correct response to each position where one of the networks knows the correct answer. MCTS will then sort out the correct response in each situation, as long as the policy net ensures it is searched. In this way, the combination can reasonably be greater than the sum of its parts.

@gcp: If the procedure introduced here is a type of regularisation (and is shown to produce stronger weight files), would it not make sense to try adjusting the regularisation term in tfprocess.py?

Interesting. I've already introduced that concept a few days ago ( #794 ).
It's easy to write codes, even if you're not familiar with deep learning technologies.
because every network weight file is plain-text file and we can figure out that all of its contents are network weights except for the first value(which is version number '1').

But I've never tried to check its strength like you.
I just checked only that it seems working not bad versus Zen7 9dan.
Very interesting experiment result.

and please note that you can't merge different sized(e.g.: 5 blocks vs 6 blocks) network weights.
I also tried to use net2net to make same block network and merge a new one, but no luck. it didn't work correctly. IMHO, because its weights came from different structured network.
So you can't merge different sized network weights, even though you use net2net.

Instead of validating via games - you might do validation via testing on prediction of pro games. Should be a lot faster in determining what networks are 'good' vs 'bad'. Then only do play testing on the good networks.

@wpstmxhs When I started merging two weights, I was just thinking that maybe the study rate is too high so "pulling the new weight a little back" might work. Therefore I only tried to average the new weights trained after 257 and 257 itself. I never thought that would really work untill my friends had helped test the strength.

For fun I just averaged the last that didn't pass 20 networks with each other by simple running the above script 20x. I don't know if the network I created is better. The only thing I notice is that the network I created is considering 5x the amount of moves every turn(all of them have 1 visit)

@MingWR I see. I also tried to extrapolate a new stronger future network's weight, from weaker networks of the past. Playing with network weights was really fun.

@LetterRip I don't think so. IMHO, because zero's networks are not from human games, so I think validating its accuracy from pro games doesn't make sense.

As you know, Leela Zero networks developed some new technical moves, like early 3-3 point invasion, and attachment right after star point(4-4) or komoku(4-3). Therefore it's normal not to fit to human's moves.

Also, this project's goal is to make zero-based strong go AI. not human pro-like AI.

The network I created from the last 20 networks that didn't pass won 4 out of 5 games against the current best with 400 playouts. Even though I used several networks that are below 10% it doesn't seem to be any worse and it might even be better.

@MartinDevelopment Can you continue this a bit more? If you can show that you got a much stronger net with in this way with statistical significance, this may be big....

How about to promote the new mixed network weights to best network weights and let many people use that for making self-play games?

I would like to listen to @gcp 's thought.

So is the effect of this just averaging out the changes of the networks and keeping what's identical between them?

I just added another 10 networks. If anyone would like to test it out you can download it here. https://drive.google.com/file/d/1t6TG4hGdZqkIbNf_FBQyHEtchc9ztCaj

Running

./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w newWeight.txt
vs
./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w 257aeeb863dc51bfc598838361225459257377a4b2c9abd3e1ac6cdba1fcc88f

@zediir It should smear the policy priors, and probably make the search tree wider but less deep in this way. It may also reduce errors in the value calculation, but I'm unsure about that part. Are you using FPU reduction code for your test?

yes. I'm running on next-branch.

@zediir I think averaging weights between networks weakens some over-fitted weight values and emphasizes more common legit values. that made the network stronger.

It seems fun to try the automated system like zero.sjeng.org server.

  1. We have network candidates. and we mix their strongest networks and let it be a network candidate.
  2. We evaluate network's strength by playing games like zero server.
    Loop 1~2 process.

Maybe it makes a super-duper stronger new network, without more distributed efforts needed.

Averaging weights from nearby networks decreases noise from mini-batch gradient calculations. If the network is very near the optimum then the weights don't change much during the training and we can think that the network weights are the optimal weights plus some noise caused by the stochastic gradient calculation. Averaging the weights decreases the amount of noise in weights and brings the network closer to the optimum.

Same effect can be had if learning rate is decreased.

I noticed that the weight files produced by this script balloon in size because of rounding errors. It shouldn't make any difference to the outcome to use single precision rounding like in regular LeelaZ weight files...

@Ttl Would increasing mini-batch size have a similar effect?

@jkiliani You can fix it. use

'%g ' % (value)

instead of

f'{value} '

For your reference:

%e: Scientific notation (mantissa/exponent), lowercase (ex: 3.9265e+2)
%f: Decimal floating point, lowercase (ex: 392.65)
%g: Use the shortest representation: %e or %f
copied from http://www.cplusplus.com/reference/cstdio/printf/

@jkiliani If learning rate is kept constant then yes. Increasing batch size while keeping learning rate fixed is pretty much equivalent to dropping learning rate and keeping batch size constant.

In principles, a mix of two weights, w1 and w2, can be done by introducing a mixing parameter, r,
w_mix = rw1 + (1-r) w2
we can empirically test the value of r, and see if the resulting w_mix is stronger than both w1 and w2. Can we run a back propagation for the matches between w1 and w2 so as to determine the mixing parameter? In fact, we may implement different mixing parameters for each layer.

@larrywang30092 It sounds not simple at all. I suggest to just simply do with generate-and-test method(random process).

Pick strongest network A and second strongest network B( or sometimes choose another random network ) and choose random rate number, and generate a new network, and test it. If the new network is stronger, make it promoted to strongest network. repeat this process.

I feel like the only reason this works quite well is that there is something fundamentally wrong how we have done training

@CheckersGuy Not so sure about that. @Ttl's explanation that this method comes down to similar results as lowering the learning rate sounds plausible to me. So far we haven't done that yet since the bootstrap.

@jkiliani Ok maybe not fundamentally wrong but if it comes down to simply changing the learningRate to achieve the same result than interpolating weightFiles, then there isn't much to it.

I think the progress is spectacular. Soon the only stronger bot on KGS will be 8 dan Golois6.

Higher are only professionals.

Has anyone actually posted some results on this yet?
As in minimum 200 game match at 1600 playouts with winrates etc?
The first post says that the two networks get 70% winrate, but calculated over how many games?

On CGOS these two networks W1 and W3 seem to win and loose exactly to the same bots as all the other strong LZ networks.

W3 has lost twice to Leela0.11 at 1600 playouts, and Leela 0.11 is only about 100 ELO stronger than the best LZ networks, and basically the next branch already plays pretty even with Leela 0.11.
W1 has lost twice to Zen 11.4. which is at the same elo as Leela 0.11. And also lost two games to low ELO bots. It is of course still early going for them on CGOS.

I'm at 12 wins, 12 losses with @MartinDevelopment's net. I calculated that it'll take me about 7 hours to run 300 matches.

LeelaZeroT uses combined net weight1.txt from weight1.zip above
against 6 dan HiraBot42 on KGS

I'm testing the original weights1.txt from the beginning of this thread against 257aeeb8, at 100 playouts so I'm getting some results quickly. Current standings:

LZ-257aeeb8-p100 v LZ-weight1-p100 (51/100 games)
board size: 19   komi: 7.5
                   wins              black         white       avg cpu
LZ-257aeeb8-p100     18 35.29%       9  34.62%     9  36.00%    289.36
LZ-weight1-p100      33 64.71%       16 64.00%     17 65.38%    291.56
                                     25 49.02%     26 50.98%

Certainly looks to me like a substantial improvement to 257aeeb8, but it's still a bit early to say, and this is also not 1600 playouts. Which networks was weights1.txt averaged over?

A combined net from just 257aeeb8 and 63498669 should very likely be stronger than either net as well.

@jkiliani If I recall correctlly, weights1 is made of 257aeeb8 and fc1c7273 with 1:1 ratio.

@MingWR Thanks. Amazing how well those nets perform, do you have ringmaster test results you could post as well?

@roy7 If there are no objections in principle to using this in the training pipeline, how about queueing some of these composite nets as matches?

The result of weight1 vs 257 under 1600 po. Although the idea sounds crazy, it seems valid.

83 wins, 47 losses
The first net is better than the second
weight1 v 257  ( 130 games)
              wins        black       white
weight1  83 63.85%   40 64.52%   43 63.24%
257      47 36.15%   22 35.48%   25 36.76%
                     62 47.69%   68 52.31%

Here is the result of weight3 vs 257

102 wins, 64 losses
The first net is better than the second
weight3 v 257  ( 166 games)
              wins        black       white
weight3   102 61.45%   49 62.03%   53 60.92%
257        64 38.55%   30 37.97%   34 39.08%
                       79 47.59%   87 52.41%

what about 63498669 vs those weight?

The first net is better than the second
newWeigh v 257aeeb8 ( 216 games)
              wins        black       white
newWeigh  128 59.26%   64 59.26%   64 59.26%
257aeeb8   88 40.74%   44 40.74%   44 40.74%
                      108 50.00%  108 50.00%
commented

http://www.yss-aya.com/cgos/19x19/cross/LZ-634986-t1-p1600.html

Opponent Rating Result Percent
leela-0.11.0-p1600 2958 0 / 2 0.00
LZ-halfblood-W3 2811 0 / 3 0.00
LZ-halfblood-W1 2722 2 / 3 66.67

What is this LZ-halfblood-W8? Seems to be strong.
http://www.yss-aya.com/cgos/19x19/cross/LZ-halfblood-W8.html

LZ-halfblood-W8 is mixed up of 257,fc1,278 and 27a。For rapid test it is not stronger than W3, i guess it was running with no playouts option on cgos.

I agree with the assessment that what you are seeing is exactly what you would expect to get after a learning rate drop. (A one time jump in performance)

I would prefer the name "LZ-hybridblood-xx".

This sounds very promising. I think in the next few days we should concentrate verifying this theory and if it works then optimize it. It could be much more useful to spend all the computing capacity on this than squeezing a few more ELOs out of the current process.

Luckily all previous networks are at hand so we can make a full backtesting to verify this and also see if this always applies. Maybe we could do the following:

1, Start with a method that seems to work, like the one mentioned in the original post.
2, Let's backtest X (for ex. 10) points in time from day 1 of LZ up till now

  • choose the best network at a given time, train a hybrid one using that and N previous networks
  • play matches between the two networks and check the results
    3, Check if this method has always some benefit
  • maybe this method does not work well in the very beginning
  • maybe this won't add extra benefits compared to the steep climb in December
    4, If yes, let's try to optimize:
  • how many networks should be merged or how many previus networks should be candidates for merging
  • what is the minimum win rate for each network to be included
  • what should be the proportion of each network

Even if this is the same as adjusting the learning rate this may be better because the adjustment of the learning rate is now manual and is based on personal opinions. Automating this concept and adding to the LZ algorithm would not be that hard.

Also, even if the resulting hybrid networks are not stronger, they will still generate training games which is good for diversity and will not touch the learning rate which can still be altered manually.

Basically all we need is to merge a few existing networks and then play some games using the distributed network. No too big effort for a potentially big benefit.

To test whether it caused by learning rate, I think we could test it on 5b network, see if hybrid 5b network better then the strongest 5b network, because the last 5b weight 's learning rate is very low.

@pangafu Yes,the match is going on.

The result of 5b hybrid vs c83:

223 wins, 180 losses
403 games played.
Status: 0 LLR 2.29102 Lower Bound -2.94444 Upper Bound 2.94444

@gcp

Do you think there could be any benefit to adopting a network such as these, and then continue training as per usual?

As you said, these network are equivalent to lowering the learning rate, and the worry is that doing so too soon, will lead to slower overall progress as it becomes harder for the network to jump to different local optima in strength.

But if you take the 'one-time boost' and then continue trying with the old learning rate, could it be that this will help since you start generating higher quality training data?
Or is the fear still that it becomes harder to move away from the supposed local optimum?

I would disagree with such approach since it is only a small gain but we broke the whole process.

@dzhurak this could be something to try right before ending this run though right ?

To try means to do it in parallel with current run. Not to replace this run with a new one. How would you compare the results?

I would guess a new run would be made once this one has stalled. So we could instead just try this method, and if the winrate rises again we could see if it is a short term rise equivalent to a lower learning rate or if it works on a longer term. Even if we can't really tell the difference we would still learn more than if we don't try at all.

Do you think there could be any benefit to adopting a network such as these, and then continue training as per usual?

No, not at all. There is about zero chance I'd use this, unless there is some indication the working is fundamentally different from lowering the learning rate.

Or is the fear still that it becomes harder to move away from the supposed local optimum?

Correct.

I've already explained at length in other threads why dropping the learning rate too early risks completely undermining all progress, and what needs to be done to determine a more optimal one, and said several times that a one time jump as is seen here is exactly what that would produce.

the adjustment of the learning rate is now manual and is based on personal opinion

And this is only so because everyone comments or proposes even worse alternatives and nobody bothers running the required tests.

There is no "opinion" involved in setting the current learning rate: it stays at the current schedule until the learning stalls (and we seem to be close to this, which is why the results here are even less surprising), and is then dropped.

It may be possible to find a more optimal rate by doing a series of tests, and ton of people put work into #747 to make that possible, but nobody has actually run the test.

@gcp "nobody bothers running the required tests."

I lack hardware and skills, but I did run weight1.zip net on KGS (as LeelaZeroT) and it did not seem stronger

Now I am running weight3.zip net against 6dan HiraBot43

No, not at all. There is about zero chance I'd use this, unless there is some indication the working is fundamentally different from lowering the learning rate.

I think there is a chance it is different. There is precedent for this kind of things in genetic algorithms. They have similar problems as we have with local optimums, and yet an operator combining 2 or more genetic codes into one to make a higher fitness child is a widely used technique (crossovers). Now even though the alphagozero algorithm has common parts with genetic algorithms it has substantial differences too which makes me unsure. But if this actually is analogous to a crossover operator it might actually not get stuck into a local optimum as long as we don't lower the learning rate.

"The advantage of recombination is that it breaks down random linkage disequilibrium generated by genetic drift. "

https://www.ncbi.nlm.nih.gov/pubmed/4448362

In biology, there is hybrid vigour, where a hybrid between two strains is better than the base variants. This vigour can usually not be maintained when you try to keep breeding those hybrids, or they're even sterile.

@odeint "hybrid between two strains"

You are quite confused. Hybrids are not between two strains. They are between different species. You get as a result for example infertile mules.

Between two strains, normal recombinations happen and are very beneficial and in most cases necessary.
Even if the difference between strains is significant, like in interracial crossbreed, the descendants are usually fine, although the professional breeders might be unhappy about getting a mongrel ;)

I'm quite sure hybrid corn or hybrid chickens aren't between different species, but just different strains/breeds of the same plant/animal species. You can have hybrids at different levels - in agriculture, hybrids between species is rather rare.

The definition of what a species actually is is pretty difficult, but one of the criteria is usually if interbreeding is possible or not (doesn't work perfectly because of things like geographical ring species).

"I'm quite sure hybrid corn or hybrid chickens aren't between different species, but just different strains/breeds of the same plant/animal species. "

By such definition mulattoes like Obama or mestizo would be called hybrids. Strange.

I'm not so sure the "hybridisation" is 100% equivalent to simply lowering the learning rate. At least in cases where one of the networks is somewhat older, it also should mean an effective increase of the window size, since the respective training windows of such a hybrid overlap only partially. The boost available from this technique may not be so easily attainable without also increasing the size of the training window.

They have similar problems as we have with local optimums, and yet an operator combining 2 or more genetic codes into one to make a higher fitness child is a widely used technique (crossovers).

This would be a valid argument if the networks would have evolved independently. But they are actually slight variations on the other. (And from the evidence, they're slightly noisy variants around a an optimum. The whole question is if this is the global optimum, in which case lowering the learning rate would be fine, or a local optimum, in which case it would be "fatal" in terms of optimal convergence)

Has anyone tried to match the new "hybrid" network not against its parent but against the next best. So you can decide if this method brings more that just training with new games.
And to match I mean to run the validation program and verify the result with SPRT (all the other test are not statistically significant).

@gcp You also could consider to promote a "not best" network every 10 or 20 to generate more diversity in the self-play games as a mean to escape local minimum. Alphazero does that for each network out could be a compromise.

For more diversity, I think it's also considerable that, in addition to only doing selfplay of current best network, when there is a new network whose ELO is no less than 35 (i.e. 45-50% winning rate) against current best one, we could get training games from matches between them.

@gcp You also could consider to promote a "not best" network every 10 or 20 to generate more diversity in the self-play games as a mean to escape local minimum

We have already discussed testing that but not for this run. (I would just always use the latest network from training)

For more diversity, I think it's also considerable that, in addition to only doing selfplay of current best network, when there is a new network whose ELO is no less than 35 (i.e. 45-50% winning rate) against current best one, we could get training games from matches between them.

There is already considerable noise injected into the networks. There's no reason to believe networks that are close in strength add any diversity. So this seems like a far worse suggestion than the previous one.

@gcp 6349 is very closed to 257a, but they plays in very different way after openning(from high dan amateur player's view), so I think it should be worth to do so.

See comments by me and @jkiliani starting here: #780 (comment)

Further comment: The use of GAs and evolutionary methods is well-established. Even without crossover, it can help escape overfitting (local optima). But it appears that the r, (1-r) method with a random r (perhaps only between [0,1], but perhaps also could try between [-1,2]) would be a reasonable approximation of crossover.

However, if you try these techniques and only keep 'the champ', then you effectively defeat the purpose of the GA, which is to rely upon a population's variability/variation/diversity to guard against over-fitting.

Therefore, if this project or another were to adopt a GA/evolutionary approach, I would highly recommend slightly restructuring the 'match' process by using a sizable pool of potential candidate networks. The general consensus is to use at least a population size of 30 at a bare minimum. For a project like this, however, which is highly distributed and with a large 'genome', I would recommend going higher, at least 100, maybe 200, maybe more.

At first glance, using such a large population size may seem very wasteful. However, due to natural selection, the 'losers' get weeded out quite quickly, and you end up with a robust, healthy population of 'pretty good' genomes, the best of which, the 'champ' so to speak, becomes the representative of the progress of the project.

By competing against a variety of competitors, and using some form of mutation (which already exists, as the training process on recently played games), the 'champ' will very likely become a much more robust 'solution', i.e. it will play Go better against a variety of other bots and humans.

The crossover operation is optional, but as this issue seems to indicate, could be easily implemented as the r, (1-r) linear combination, with r chosen randomly as a kind of proxy for the 'crossover points'. If r is allowed to range outside of [0,1], such as [-1, 2], then this allows crossover not just to produce convex (i.e. internal) weighted averages, but will actually allow it to occasionally produce exterior weighted 'averages'.

When performing such crossover operations, I would recommend using two parents (obviously) to produce two complementary offspring. One using [r, (1-r)] as parameters, and the other using [(1-r), r]. These parameters can be applied in a one-shot way to all the weights, or you could randomly assign the r or (1-r) to each weight, and in the complementary offspring, you would use the complementary (1-r) or r, respectively.

Training (mutation) could be applied before crossover or after. Arguments could be made for either case, but simply testing both methods would probably answer the question. Alternatively, you could randomly choose to apply mutation before or after crossover for each mating.

This GA/evolutionary strategy could be used in conjunction with a learning parameter just fine. It could also be considered a simple extension of the AlphaZero technique of not bothering with match validation, and only keeping the most recent mutated offspring of the previous 'champ'. In other words, AlphaZero is already doing this technique, but without crossover, and with a population size of exactly 1. (And no 'elitism' as defined in the GA literature.) The LeelaZero project could adopt this strategy also, and simply increase the population size to some number N, and introduce a linear combo crossover operation.

If you're worried that crossover will cause problems, just introduce a pCrossover parameter and set it to some low value like 0.05 or 0.1. Otherwise, you can introduce an elitism parameter (usually set to 1, rarely 2 or more) which simply means that the current 'champ' (i.e. the one with the current best overall score, presumably Elo or some other proxy for skill) can never go extinct accidentally (by losing a tournament against a weaker competitor by chance).

'Tournaments' could be set to consist only of 1 game (the AlphaZero approach, I'm guessing; not sure), or of some number of games, presumably up to the current parameter for matches of 400 games. You could even keep the SPRT feature to end obvious tournaments early.

Instead of separating between self-play and match games, you would just include all games, even between different nets, into the training set (unless you wanted to generate some extra games by some method for the purposes of have a 'validation data set'). Of course, there's always the chance/likelihood that a net may randomly be matched up against itself for a tournament, in which case it's equivalent to generating self-play games.

This overall approach could be adopted only a little bit or go full bore, and could be adopted in steps gradually from the current code base.

The main point I'd like to get across is that it's a well-established meta-optimization technique, and not something out-of-the-blue or fundamentally a mistaken idea. It's just a question of whether you want to incorporate it or not, how much or how little to incorporate it, and what parts to implement first.

As to the last question, I personally would recommend implementing the (populationSize > 1) part first, since that's where you'll get the most tried-and-true benefit from the technique. Then, obviously, implementing the linear crossover operator would be an easy next step. Then you could start exploring how to implement the tournament method, as there are a wide variety of different ways to do it (although the simple method of just randomly selecting networks from the population to compete would be easy and is already very effective).

I think the hardest part from a technical standpoint would be that the generation/mutation step (i.e. training from previous games to produce a new candidate network) is time consuming and data intensive. (Usually in GAs, this is not the case; usually generating new genomes is cheap, but evaluation/selection/tournaments take the most time.) Still, this should not be a complete barrier to trying the GA/evolutionary approach; it just means you'll need to adapt it a wee bit to accommodate slow generation/mutation, and the resultant population turnover from reproduction and death/extinction.

Since the current self-play game generation and 'current champ' network training technique is already parallelized and decoupled, I don't think this will be a difficult technical consideration actually. Just modify the currently working method and adapt it so that old networks don't immediately go 'extinct' if they fail to beat the 'champ' in a 400 game match. Instead, allow them to survive and participate randomly in the self-play (now competitive-play) game generating process with all the other N networks in the overall population of 'current contenders'. Whenever a new network (or two, in the case of crossover, as suggested) is added to the population, just remove/kill-off/extinguish either a) the current worst of the contenders, or b) a randomly selected individual contender, with probabilities weighted to favour killing off the weaker contenders.

But the method used here is nothing like the crossover used in GA. In GA Crossover you have to substitute a part of the parameter with the parameter of another entity, not make some sort of weighted average. So if you want to do a proper crossover you have to take a bunch of weights of one network and put them in another, but you need to substitute them not "fuse" them together.

I would rather vote for trying more training steps. With 5x64 we were training until 256k but now only 128k. I understand that it requires more time but all recent best networks tend to be picked after 64k. Also since 6x128 have much more parameters it is required to be trained more.

Are there still 10x128 networks being trained on the newest data, like the ones that were tested on Jan 20 (there are 5 on the server from that day)? Once they surpass 6x128 on the same training data, it can be concluded that 6x128 is close to saturation.

I would rather vote for trying more training steps

This is nothing you can "vote" about. You can test it and demonstrate that more training steps give better convergence:

step 16000, policy=2.38733 training accuracy=50.4373%, mse=0.161347
....
step 104000, policy=2.38264 training accuracy=50.6428%, mse=0.16285

So no, right now more training steps won't help convergence. Lowering the learning rate would (with usual caveats) or waiting for more new games in the training window. Everything else is just the random shuffling occasionally ending up at a higher/better point.

It's possible that with more games in the window, the curve looks better for longer training - I don't have the one that promoted 63498669 in my scrollback - but given that the later networks there also were bad I doubt so.

But the method used here is nothing like the crossover used in GA. In GA Crossover you have to substitute a part of the parameter with the parameter of another entity, not make some sort of weighted average.

It's been shown that you needn't (and in fact shouldn't) become too attached to the 'traditional' definition of crossover. E.g. E Falkenauer showed long ago (in Grouping Genetic Algorithms ) that you can interpret 'crossover' abstractly as an operator that simply takes two (or more, but usually two) candidate solutions/genomes and produces one or more (usually two, but not necessarily) solutions/genomes that in some way are more likely than simple uniform random mutation to combine/mix in a more-optimal-than-either-parent way. In his examples, he used genomes that somehow encoded information about groups (classic example would be the knapsack problem), which are notoriously hard to encode in a straightforward, 'traditional' GA bitstring -- or even GP tree-style -- genome (that can simultaneously produce sensible results from generic traditional crossover). Though his abstract grouping-crossover operator was anything but traditional, he managed to achieve excellent results regardless.

By using a linear combination of two sort-of-optimal vectors in a linear space, the vector difference between the two parent vectors, e.g. C = B - A, represents a line running through the two parent vectors (considered as points in the parameter space), and that line is more likely to pass nearby a more-optimal-than-either-parent part of the parameter space than a random line from candidate A in a randomly chosen direction.

This method is already used fairly extensively in various kinds of MCMC (Markov Chain Monte Carlo) techniques to produce more-likely-optimal solution directions for exploration in high-dimensional spaces (which our networks definitely live in). So even if you want to say that it's not 'traditional' crossover, it's still a tried-and-tested method for producing new better-than-random-direction candidate solutions from existing ones.

Of course, neural nets are non-linear, and earlier I suspected that this technique would not likely produce useful results in the general case -- which it probably still wouldn't, generally -- but it seems I was overly pessimistic in these particular circumstances of quite-closely-related networks. For the purposes of an instance of a Falkenauer-type abstract crossover operator, provided the networks are not too dissimilar, it appears that it already works just fine. It doesn't have to be 'perfect' or 'exactly traditional' for it to be 'good enough'. That's one of the nice things about GAs/evolutionary techniques: they are pretty robust to using imperfect operators. 'Good enough' is generally good enough.

So if you want to do a proper crossover you have to take a bunch of weights of one network and put them in another, but you need to substitute them not "fuse" them together.

If you imagine that the mixing parameter r is randomly chosen from the two-element set {0, 1} for each weight, then you could achieve your ideal of 'substitution' using exactly the same [r, (1-r)] linear combination as I proposed earlier, with the complement [(1-r), r] chosen instead uniformly randomly, 50/50 for each weight. So, it appears that this conception of 'traditional' crossover is just a subset of linear combination with r chosen from a continuous range which includes 0 and 1, rather than strictly in a binary fashion.

@gcp

There is already considerable noise injected into the networks. There's no reason to believe networks that are close in strength add any diversity. So this seems like a far worse suggestion than the previous one.

That is certainly true in the case when the proposed populationSize parameter is kept at 1. However, you can begin harnessing diversity/variance if you introduce a population size bigger than 1. As a very rough rule of thumb, a population size of around 30 or thereabouts begins to contain enough useful variation that natural selection can begin working on it while maintaining enough diversity to keep the entire population from getting stuck in local optima. And you can always maintain an elitism parameter of 1 just to be sure you never lose the current-best champ from the population by accidental defeat by a weaker candidate.

Even just introducing a small population of say 5 top contenders might improve the current situation. Heck, even 2 might do something. Could always start small and see.

Personally, I have a hunch the current networks are stuck in a local rut, though I only have intuition (from seeing many recent match and self-play games) to back that up. I'm guessing that allowing for even a tiny bit of additional variation/diversity in the form of a pop size greater than 1 would shake things up enough to perhaps begin working itself out of the rut. Just my guess. Aside from that guess, I'm much more certain that a larger population size would definitely help escape local ruts in the long term. Whether or not it's worth it to change is a whole 'nother question. Perhaps another project, or a fork of this one, would be more appropriate to try this out on, as an experiment.

Of course, neural nets are non-linear, and earlier I suspected that this technique would not likely produce useful results in the general case -- which it probably still wouldn't, generally -- but it seems I was overly pessimistic in these particular circumstances of quite-closely-related networks. For the purposes of an instance of a Falkenauer-type abstract crossover operator, provided the networks are not too dissimilar, it appears that it already works just fine.

Yeah I was surprised too. Is it something analogous to different species not being able to breed together while same species can ?

@wctgit Put how you want crossover is build to combine two individual exploring different places of the multidimensional space, and the resulting child will explore even a new part of that multi-space. What is it tried here is to get two individual that are almost the same and hope that it comes out something completely different. The offspring of our networks will be a network working in the same region as the parents because they are really close. The first is a bit better but do it again and you'll see it will get a worse. since you move around that optimum. If you would take a really different network, like one old that was discarded, then you could introduce something new, a start moving in the multi-space, but like in GA you will see that you will throw away like 95% of the offspring because they do not bring any improvement so you will lose all the time testing bad networks.

@zediir @godmoves , how do you run the games between two weights automatically?

@optimistman I use the validation project to test between to networks or two binaries. (https://github.com/gcp/leela-zero/tree/master/validation)

Put how you want crossover is build to combine two individual exploring different places of the multidimensional space, and the resulting child will explore even a new part of that multi-space. What is it tried here is to get two individual that are almost the same and hope that it comes out something completely different.

In a high-dimensional space, two 'almost the same' vectors can nevertheless be quite 'distant' in the space, especially if you're exploring in the neighbourhood stochastically (i.e. wandering in some more-or-less drunken walk). This is especially troublesome if there are complex dependencies/correlations between changes in one dimension and changes in another. With so many possible directions to wander in, you're likely to stray off the path of increasing optimality.

It's like two people wandering in a multi-dimensional fog, trying to find each other at a middle-ground meeting place. If only they had a string held tightly between them, to guide them toward each other, they would already know which direction is a middle-ground.

By taking a vector difference, C = B - A, you automatically get that 'string', which represents a direction which you know contains two 'pretty good' points on it. Maybe somewhere in between them, or even perhaps 'behind' one or the other a little bit, you will find an even more optimal location. There's no guarantee, of course! It's just more likely than the more-or-less drunken walk (stochastic gradient descent).

So, with this kind of linear crossover of not-too-distant networks, you're more likely to get an offspring that is better than both parents. Not guaranteed, just more likely.

Crossover alone will not help with escaping local optima. Crossover is only so good as the population variance it has to play with. Therefore, you need several points/vectors/individuals/networks, and a source of stochastic variance between them, to really benefit from this kind of linear combo crossover, IMHO.

The most important factor in avoiding local optima traps is generating a healthy level of variation -- which the current training methods already presumably do (otherwise this project is doomed, anyway; but we know that AGZ and AZ have already worked, so probably not doomed after all) -- and maintaining that variation, which increasing the population size above 1 would do.

In evolutionary terms, cross-over is great for shuffling existing genetic information, but ultimately you need a source of random mutation and a big-enough population to avoid permanent bottle-necking and inbreeding.

Note: Some of the above is based on actual experience, but much of it is merely my educated opinion, based on my previous learning/research and others' reports of what works. I could easily be way off, forgetting some crucial assumption, or missing some important detail, but if I was just talking entirely out of my ass, I wouldn't bother to put forth these ideas in the first place. I'd keep it to myself instead, to avoid saying something foolish. Again, could be wrong, but pretty sure I'm safely in the ballpark.

If you would take a really different network, like one old that was discarded, then you could introduce something new, a start moving in the multi-space,

Again, that's why you need a decent sized population, so you don't throw out too many candidates. But even if you happen to throw out too many candidates at some point, that's why you have some form of continual mutation rate, so that you have the opportunity to wander outside the local optimum again. That's why GAs work as well as they do. Not just selection, but mutation, too. Exploration and exploitation.

Yeah I was surprised too. Is it something analogous to different species not being able to breed together while same species can ?

Perhaps that's a good analogy, I'm not sure. For speciation, I would imagine that the important consideration there would be different environmental niches; a multi-modal fitness landscape. Probably don't need to get that complicated for learning optimal Go. (In fact, AGZ proves we don't, at least within the range of human play. It can beat everybody, not just some limited styles of players.)

and an article that introduces the application of genetic algorithms in deep learning.by Uber deep-neuroevolution (https://eng.uber.com/deep-neuroevolution/) Of course, this algorithm needs a huge amount of computation, and I can't implement it on a single machine.so sad

Well, whaddayaknow! That's roughly the same idea I was talking about. Looks like it actually works. Cool! Glad to see I wasn't too far off track. :-D

@zediir How do you implement the validation code? I can not find the project file to build.

It's a qt project. Build process is the same as building autogtp.

The hybrid of two weights is, to the first order approximation, equivalent to a reduction of learning rate. However, this exercise brings up an interesting point: we should implementing the symmetry of the weights using this type of hybrid approach. In AGZ paper, the symmetry is explicitly implemented. This translates to a faster update of the weights, since for a typical point (not at the center nor diagonal lines), the update of one point amounts to the propagation to 8 equivalent points. In this way, the learning can be drastically improved.

@larrywang30092 I don't quite understand why this is equivalent to a reduction of learning rate to the first order approximation. Could you explain more? I'd prefer some mathematic insights here.

@zediir I am downloading the qt now, is open source version of Qt good enough?

@bood when solving coupled equations, one standard approach is to linearize the changes, and to take a fraction of the predicted changes in for next iteration. The latter step is to ensure a steady-fast convergence, commonly used in steepest descent optimization process.
In this case, the update can be considered as the linear prediction for the optimal move, and the learning rate is to determine how big a fraction of the predicted update needs to be added to the new weight. As such, the hybrid of the weight can be viewed as a reduction of the learning rate.

Hello everyone, I create an auto hybrid leelazero weights and test program in github
https://github.com/pangafu/Hybrid_LeelaZero

Please feel free to hybrid the weight you like and test.

Has anyone tried rotating and transposing the weights and then merge them in this way? I believe there are 8 transpositions, so it'd be 8 weight files at 1/8 weight each. I'd be very curious how that does vs the original weight file used to test this.