RuntimeError with CUDA assertion failure when resuming model training from checkpoint
fancling opened this issue · comments
I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:
RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.
This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.
I am willing to assist in resolving this issue if I can be of any help.
Update on the issue
After further investigation, I've identified the source of the problem that leads to the assertion failure:
The comparison operation if losses["val"] < best_val_loss or always_save_checkpoint:
fails because losses["val"]
is located on the CPU, while best_val_loss
is loaded from the checkpoint directly onto the CUDA device due to checkpoint = torch.load(ckpt_path, map_location=device)
.