Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trainer.fit from checkpoint without performance improvement will break 'last' link to checkpoint on window11

workhours opened this issue · comments

Bug description

just as titled, training a model on window11, pass a checkpoint callback to trainer and keep ckpt_path as None as code below, then fit model with data and lightning will create link well to checkpoint file.
then trains the same model again but load model from ckpt_path, this time make it no improvement while fitting model. after training done then 'last' link become wrong.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

checkpoint_callback = ModelCheckpoint(
        monitor='val_loss',  # 监控的指标
        dirpath='training/checkpoints/',  # 保存检查点的目录
        filename=experiment_name+'-{epoch}-{val_loss:.3f}',  # 检查点文件名的格式
        save_top_k=1,  # 仅保存最佳的一个模型
        mode='min',  # 因为是损失,所以越小越好
        save_last='link',
        save_on_train_epoch_end=True,
        every_n_epochs=5
    )
...
    trainer.fit(model, ckpt_path=None if initial else 'last')
    trainer.test(model)

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response