Sometimes a random initial net doesn't progress.

Question

Sometimes a random initial net doesn't progress.

Sopel97 opened this issue 4 years ago · comments

This happens randomly with randomly initialized nets. A simple

setoption name SkipLoadingEval value true
setoption name Threads value 4
setoption name Use NNUE value pure
learn targetdir data_q loop 4 batchsize 100000 use_draw_in_training 1 use_draw_in_validation 1 lr 1 lambda 1 eval_limit 32000 nn_batch_size 1000 newbob_decay 0.99 eval_save_interval 10000000 loss_output_interval 100000 set_recommended_uci_options

with any data should replicate the issue at least some >50% of the time. The issue can be seen by inspecting the startpos eval and norm values at each iteration. norm sometimes looks stuck at 2000*startpos_eval which means all evaluated positions had +- the same eval -> only biases were applied [?]. This can last for many [possibly infinite amount of?] iterations. This looks to be correlated with min/max activations quickly converging to 0 or 1. The problem seems to be rooted in relu layers which, during backpropagation, zero the gradient when output was outside of the exclusive range 0..1 (it's based on the observation that when the gradient is made to never be zero'ed the training doesn't get stuck, though then it's broken in other ways obviously).

Tomasz Sobczyk · Answer 1 · Mon Nov 16 2020 17:35:06 GMT+0800 (China Standard Time)

Solved by #242