Sometimes a random initial net doesn't progress.
Sopel97 opened this issue · comments
This happens randomly with randomly initialized nets. A simple
setoption name SkipLoadingEval value true
setoption name Threads value 4
setoption name Use NNUE value pure
learn targetdir data_q loop 4 batchsize 100000 use_draw_in_training 1 use_draw_in_validation 1 lr 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.99 eval_save_interval 10000000 loss_output_interval 100000 set_recommended_uci_options
with any data should replicate the issue at least some >50% of the time. The issue can be seen by inspecting the startpos eval
and norm
values at each iteration. norm
sometimes looks stuck at 2000*startpos_eval
which means all evaluated positions had +- the same eval -> only biases were applied [?]. This can last for many [possibly infinite amount of?] iterations. This looks to be correlated with min/max activations quickly converging to 0 or 1. The problem seems to be rooted in relu layers which, during backpropagation, zero the gradient when output was outside of the exclusive range 0..1 (it's based on the observation that when the gradient is made to never be zero'ed the training doesn't get stuck, though then it's broken in other ways obviously).
Solved by #242