Differentiate testing multiple sets/models when logging

Question

Differentiate testing multiple sets/models when logging

leleogere opened this issue 3 months ago · comments

Description & Motivation

In my problem, I need to evaluate my trained model twice, on two different sets at the end of my training:

trainer.test(model, dataloaders=test_dataloader1)
trainer.test(model, dataloaders=test_dataloader2)

However, both scores are logged with the same key (I'm using wandb logger), meaning that they are merged into a single metric. I can always get the two values separately using their API, but in their UI, it's not easy (if even possible) to see and compare them.

This is also a problem when trying to evaluate two different checkpoints:

trainer.test(model, dataloaders=test_dataloader, ckpt_path="last")
trainer.test(model, dataloaders=test_dataloader, ckpt_path="best")

Pitch

Ideally, it would be handy to allow Trainer.test (and maybe the other fit, validate and predict) to take kwargs arguments, that would be directly passed to LightningModule.test_step and LightningModule.on_test_epoch_end.

This would allow letting the user managing the logging process depending on its own arguments:

# Training script
trainer.test(model, dataloaders=test_dataloader1, name="test1")
trainer.test(model, dataloaders=test_dataloader2, name="test2")

# LightningModule
def test_step(self, self, batch, batch_idx, name = "test")
    y_pred = self.forward(batch["x"])
    y_true = batch["y"]
    acc = self.accuracy(y_true, y_pred)
    self.logger.log(f"{name}/acc", acc)

This would result in score being logger to test1/acc and test2/acc, making it easy to differentiate them in the wandb UI and the logs.

Alternatives

For the case of multiple test sets, one could first merge them and passing them as one unique dataloader. However, this prevents comparing the performance on each individual dataset.

Additional context

No response

cc @Borda