I have a question about resuming training

Question

I have a question about resuming training

deeppower opened this issue 6 years ago · comments

Thanks for your great work.
I want to resume training from snapshot.ckpt. So, I loaded checkpoint and changed the start of iteration.
But, the following error occurred.
line 190, in forward loss_xy = bceloss(output[..., :2], target[..., :2]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 504, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2027, in binary_cross_entropy input, target, weight, reduction_enum) RuntimeError: reduce failed to synchronize: device-side assert triggered

Hiroto Honda · Answer 1 · Sat Dec 29 2018 20:27:32 GMT+0800 (China Standard Time)

Thank you for using our repo!

Did you change anything from the previous settings?
If yes, I recommend resuming with lr burn-in.
If no, I think it should work... please try it several times and see if the error persists.

Tantan · Answer 2 · Wed Jan 02 2019 09:43:43 GMT+0800 (China Standard Time)

@hirotomusiker
Thanks! Resuming with lr burn-in is effective.