trainer.fit from checkpoint without performance improvement will break 'last' link to checkpoint on window11
workhours opened this issue · comments
workhours commented
Bug description
just as titled, training a model on window11, pass a checkpoint callback to trainer and keep ckpt_path as None as code below, then fit model with data and lightning will create link well to checkpoint file.
then trains the same model again but load model from ckpt_path, this time make it no improvement while fitting model. after training done then 'last' link become wrong.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
checkpoint_callback = ModelCheckpoint(
monitor='val_loss', # 监控的指标
dirpath='training/checkpoints/', # 保存检查点的目录
filename=experiment_name+'-{epoch}-{val_loss:.3f}', # 检查点文件名的格式
save_top_k=1, # 仅保存最佳的一个模型
mode='min', # 因为是损失,所以越小越好
save_last='link',
save_on_train_epoch_end=True,
every_n_epochs=5
)
...
trainer.fit(model, ckpt_path=None if initial else 'last')
trainer.test(model)
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response