jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.

Home Page:https://jonathan-laurent.github.io/AlphaZero.jl/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it worth to continue training after the memory buffer was filled with samples?

smart-fr opened this issue · comments

In order to avoid an ERROR: Out of GPU memory, I had to reduce my mem_buffer_size quite drastically (even though there's not yet a logical explanation why this works).

As a result, the memory buffer is filled with samples after only a few training iterations.
I am wondering if subsequent training iteration truly improve my agent or are just a waste of computing time, since I am observing the following after the memory buffer was filled: (using netparams taken from the connect-four example, which might not be optimal for my game)

  • the learning phase very rarely manages to reduce the loss
  • the network is very rarely replaced after the checkpoint evaluation (it seems it is replaced only when the loss was successfully reduced), as if it had converged to a stable state.

Playing against my agent, which reached a fair level, yet I suspect it has stopped improving after the memory buffer was filled. I can't be sure of this, I will further test and maybe try pitting two versions of it against each other.

But I have the following theoretical questions:
(I understand that the samples generated during self-play are meant to incur both new visits of existing MCTS nodes and brand new nodes.)

  • during a new iteration after the memory buffer was filled with samples, are some nodes / visits of the current MCTS forgotten, while new ones are created based on the new samples generated during self-play?
  • can we say that the impacted nodes are probably improved by this "creative destruction" process, since their statistics are less polluted by silly moves inspired by early NN?
  • since the MCTS learns at every iteration, is it correct to say that the agent improves during an iteration even when the NN isn't replaced after the checkpoint evaluation?

First of all, you should make sure that the buffer size is at least as big as the number of samples generated during a single iteration. Otherwise, you are just throwing away compute.

If the network is not replaced after an evaluation, it means that it is not significantly better than the one already in use to generate data. However, it is possible for the optimization process to go through local minima and sometimes getting an improvement is going to take several iterations.

As a rule of thumb though, if the network is never updated during a number of consecutive iterations large enough for the memory buffer to be fully renewed, then it is likely that learning has stalled or something is wrong with your hyperparameters. Indeed, in this case, the quality of data in your buffer is not improving anymore and self-play data is being wasted.

Finally, the advantage of a bigger buffer is that you get more training data and possibly better generalization. You also get better sample efficiency by reusing samples in multiple batch updates. What the ideal size of the memory buffer should be is hard to say in general and a critical hyper-parameter to tune. However, 40K samples looks pretty low (2M is typical for connect four).

Thank you for all the insight.
I understand it probably makes no sense to say that the agent improves during an iteration even when the NN isn't replaced after the checkpoint evaluation.
As a matter of fact, it appears now (after 60+ iterations) that to replace the NN it takes almost exactly the number of consecutive iterations large enough for the memory buffer to be fully renewed.

Note that the agent does not technically improve when the NN is not replaced since the same NN is still used to generate data but it does not mean progress isn't being made, either by improving the quality of the data in the memory buffer or by optimizing the current network (a lower loss may not mean a better NN right now but it may lead to it in the future).