RuntimeError with CUDA assertion failure when resuming model training from checkpoint

Question

RuntimeError with CUDA assertion failure when resuming model training from checkpoint

fancling opened this issue 2 months ago · comments

I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:

RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.

This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.

I am willing to assist in resolving this issue if I can be of any help.

fancling · Answer 1 · Fri Mar 22 2024 15:50:10 GMT+0800 (China Standard Time)

Update on the issue

After further investigation, I've identified the source of the problem that leads to the assertion failure:
The comparison operation if losses["val"] < best_val_loss or always_save_checkpoint: fails because losses["val"] is located on the CPU, while best_val_loss is loaded from the checkpoint directly onto the CUDA device due to checkpoint = torch.load(ckpt_path, map_location=device).