ModelCheckpoint: Using save_top_k, only the first k models are stored, not the best k models

gboeer opened this issue · comments

Bug description

From the documentation, I got the impression that using the save_top_k argument of the ModelCheckpoint callback, would lead to the best k models being stored, according to the monitored value. However, in my experiments only the first 3 models (from epoch 0, 1, 2) are stored and nothing afterward. I made sure that the monitored value indeed is higher for later epochs, which I can see clearly from the logged metrics.csv.

So either way this is a bug or I simply misunderstood the meaning of this parameter.

What version are you seeing the problem on?


How to reproduce the bug

import lightning as L
from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(save_top_k=3, monitor="val_accuracy")
trainer = L.Trainer(accelerator='gpu', devices=[0], log_every_n_steps=10, callbacks=[checkpoint_callback])

# val function of my LightningModule
def validation_step(self, batch, batch_idx):
    inputs, labels = batch        
    outputs = self.model(inputs)
    loss = self.val_criterion(outputs, labels)
    _, predictions = torch.max(outputs, 1)
    val_accuracy = torch.sum(predictions == / labels.size(0)
    self.log('val_loss', loss)
    self.log('val_accuracy', val_accuracy)
    return loss

Error messages and logs

Current environment
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.10
    - release: 4.15.0-213-generic
    - version: #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023

More info

Seems like maybe you need to set ModelCheckpoint(save_top_k=3, monitor="val_accuracy", mode="max") so that it will save the checkpoint with the highest accuracy. The default value is mode="min" to save the minimum loss.

Ohh, you're absolutely right. Guess the default is meant for using the loss. Stupid me.