Model loaded from checkpoint has bad accuracy
Inspirateur opened this issue · comments
What is your question?
I have a model that I train with EarlyStopping and ModelCheckpoint on a custom metric (MAP).
The training works fine, after 2 epochs the model reaches 96% MAP however when I load it and test it with the exact same function the MAP is 16% (same as untrained model).
I must be doing something wrong, but what ?
Code
def default_model(dataset: str):
if torch.cuda.is_available():
print("Using the GPU")
device = torch.device("cuda")
else:
print("Using the CPU")
device = torch.device("cpu")
kwargs = {
"dataset": dataset, "embed_size": 50, "depth": 3,
"vmap": Graph3D.from_dataset(dataset).vocabulary,
"neg_per_pos": 5, "max_paths": 255, "device": device
}
try:
model = TAPKG.load_from_checkpoint("Checkpoints/epoch=2-step=612260.ckpt", **kwargs).to(device)
return model
except OSError as e:
print(f"Couldn't load the save for the model, training instead. ({e.__class__.__name__})")
model = TAPKG(**kwargs).to(device)
cpt = pl.callbacks.ModelCheckpoint(monitor="MAP", mode="max", dirpath="Checkpoints", save_top_k=1)
trainer = pl.Trainer(
gpus=1,
check_val_every_n_epoch=1,
callbacks=[
cpt,
pl.callbacks.EarlyStopping(monitor="MAP", mode="max", min_delta=.002, patience=2)
],
auto_lr_find=True
)
# noinspection PyTypeChecker
trainer.fit(model)
print(cpt.best_model_path, cpt.best_model_score)
return model
def eval_link_completion(dataset):
model = default_model(dataset)
ranks = model.link_completion_rank()
MAP(ranks, plot=True)
Right after the training eval_link_completion
shows a MAP of 96%, when I load the model however it's back to 16%.
- OS: KUbuntu 20.04
- Packaging pip
- Version 1.2.0
I don't think the model you are returning is the trained model, it's the original model from when you first create it. Try doing
model = model.load_from_checkpoint(cpt.best_model_path)
after trainer.fit() and return model then.
Thank you for answering, but I'm not sure I understand.
Why would loading like I do yield the original untrained model ?
How would you suggest that I load my model instead ?
The training takes 6h so it would be nice if I could avoid calling trainer.fit() again, but if it's mandatory for debugging I will.
Wait after digging on my end it seems I might have a non deterministic process throwing the evaluation off between the runs, thank you anyway, I'll come back after investigating this.
Do this in default_model function:
# noinspection PyTypeChecker
trainer.fit(model)
print(cpt.best_model_path, cpt.best_model_score)
model = model.load_from_checkpoint(cpt.best_model_path)
return model
Yep I'm sorry, my loading/saving code was good, I just had another issue somewhere, thanks for your time
Hi I'm facing the same issue. Could you tell me what other potential issues could cause this ?
I'm afraid i can't help you, it's been more than a year and I'd be completely unable to remember what the problem was
same problem