cvg / pixloc

After 21850 training iterates, I got NAN in UNet extracted features.
Could you give any advice that where of the source code should I look into?

What dataset are you training with?
Could you try to enable anomaly detection by uncommenting this line? Please then report the entire stack traceback.

pixloc/pixloc/pixlib/train.py

Line 207 in 90f7e96

# torch.autograd.set_detect_anomaly(True)

There could appear NaNs in the solver step if the optimization is too difficult, but this should already be handled by the code.
Did you try to train with a different random seed? Is the NaN always appearing at the same training iteration?

There could appear NaNs in the solver step if the optimization is too difficult, but this should already be handled by the code.
-> may I know how you handled the case? by 'too few match points' check?
Did you try to train with a different random seed? Is the NaN always appearing at the same training iteration?
-> I loaded pertained CMU model, and fine-tune on Kitti data. I did not change the random seed. Nan is not always appearing at the same training iteration, but appearing around 29000~34000 iteration if it recurs.

I'm also having these kind of issues. Training in the same MegaDepth dataset with different configurations of U-Net (encoder pretrained on other data, frozen encoder, deleting decoder, etc). All of them lead to NaN at some point during the optimization. I didn´t conclude yet if they come from the optimization or from features directly.

Edit: I did not change the random seed either and the error does not repeat in the same iteration. Seems to appear randomly in the middle of training.

Does the anomaly detection show that NaNs consistently appear at the same ops?
Any spike in the loss function in the preceding iterations?
What versions of numpy & pytorch are you using?
Does reducing the learning rate help?

This is concerning; let me dig into it (this will likely take me a few days).

The output for the anomaly detection always points to the power operation in the loss estimation. But the NaN trace comes from the pose optimization, not sure if from features or from the pose itself. I'm running another training so I hope to give some more information.

[11/02/2021 07:16:05 pixloc INFO] [E 7 | it 2450] loss {total 3.257E+00, reprojection_error/0 9.695E+00, reprojection_error/1 8.376E+00, reprojection_error/2 8.366E+00, reprojection_error 8.366E+00, reprojection_error/init 3.127E+01}
[11/02/2021 07:16:06 pixloc.pixlib.models.two_view_refiner WARNING] NaN detected ['error', tensor([ nan, 5.0000e+01, 1.4714e-01, 2.6252e-03, 2.5921e-02, 3.2593e-02],
device='cuda:0', grad_fn=), 'loss', tensor([ nan, 0.0000, 0.0490, 0.0009, 0.0086, 0.0109], device='cuda:0',
grad_fn=)]
[W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward1. Traceback of forward call that caused the error:
File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jmorlana/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 391, in
main_worker(0, conf, output_dir, args)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 358, in main_worker
training(rank, conf, output_dir, args)
File "/home/jmorlana/pixloc/pixloc/pixlib/train.py", line 259, in training
losses = loss_fn(pred, data)
File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 151, in loss
err = reprojection_error(T_opt).clamp(max=self.conf.clamp_error)
File "/home/jmorlana/pixloc/pixloc/pixlib/models/two_view_refiner.py", line 133, in reprojection_error
err = scaled_barron(1., 2.)(err)[0]/4
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 81, in
return lambda x: scaled_loss(
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 18, in scaled_loss
loss, loss_d1, loss_d2 = fn(x/a2)
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 82, in
x, lambda y: barron_loss(y, y.new_tensor(a)), c)
File "/home/jmorlana/pixloc/pixloc/pixlib/geometry/losses.py", line 59, in barron_loss
torch.pow(x / beta_safe + 1., 0.5 * alpha) - 1.)
(function _print_stack)

Validation total loss jumps to 15 (previous was 3) after the first NaN appear. All the training losses that come after become NaN too.
My version for torch is 1.7.1 and for numpy is 1.19.5
I haven't checked a different learning rate yet, I will give it a try.

Thank you!

Thank you for the analysis. I have reproduced the issue:

[W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
File "pixloc/pixlib/train.py", line 417, in
main_worker(0, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 383, in main_worker
training(rank, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 281, in training
pred = model(data)
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "pixloc/pixloc/pixlib/models/base_model.py", line 106, in forward
return self._forward(data)
File "pixloc/pixloc/pixlib/models/two_view_refiner.py", line 117, in _forward
mask=mask, W_ref_q=W_ref_q))
File ".local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/pixloc/pixloc/pixlib/models/base_model.py", line 106, in forward
return self._forward(data)
File "pixloc/pixloc/pixlib/models/base_optimizer.py", line 97, in forward
data['cam_q'], data['mask'], data.get('W_ref_q'))
File "pixloc/pixloc/pixlib/models/learned_optimizer.py", line 78, in run
delta = optimizer_step(g, H, lambda, mask=~failed)
File "pixloc/pixloc/pixlib/geometry/optimization.py", line 18, in optimizer_step
diag = H.diagonal(dim1=-2, dim2=-1) * lambda
(function _print_stack)
Traceback (most recent call last):
File "pixloc/pixlib/train.py", line 417, in
main_worker(0, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 383, in main_worker
training(rank, conf, output_dir, args)
File "pixloc/pixlib/train.py", line 292, in training
loss.backward()
File ".local/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ".local/lib/python3.7/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function 'MulBackward0' returned nan values in its 1th output.

RuntimeError:Function 'PowBackward1' returned nan values in its 0th output #16

I believe that the issue has been addressed by 8937e29 and 0ab0e79. Can you please confirm that this helps? I will continue to investigate other sources of instabilities.

i tested the change code ,but get the same error .

@angiend what dataset are you training with? at which iteration does it crash? with what version of PyTorch?

@skydes i retrain on CMU dataset, crash at "E 65| it 800 "(3000 iter at each epoch),and my pytorch version is 1.9.1

The training has usually fully converged at epoch 20 so this should not prevent reproducing the results. Could give a try to PyTorch 1.7.1? I have tried both 1.7.1 and 1.10.0 and both work fine.

Thanks， I have test 3 epochs and I think this issue has been fixed.

NAN appear during training