Model loaded from checkpoint has bad accuracy

Question

Model loaded from checkpoint has bad accuracy

Inspirateur opened this issue 3 years ago · comments

What is your question?

I have a model that I train with EarlyStopping and ModelCheckpoint on a custom metric (MAP).
The training works fine, after 2 epochs the model reaches 96% MAP however when I load it and test it with the exact same function the MAP is 16% (same as untrained model).
I must be doing something wrong, but what ?

Code

def default_model(dataset: str):
	if torch.cuda.is_available():
		print("Using the GPU")
		device = torch.device("cuda")
	else:
		print("Using the CPU")
		device = torch.device("cpu")
	kwargs = {
		"dataset": dataset, "embed_size": 50, "depth": 3,
		"vmap": Graph3D.from_dataset(dataset).vocabulary,
		"neg_per_pos": 5, "max_paths": 255, "device": device
	}
	try:
		model = TAPKG.load_from_checkpoint("Checkpoints/epoch=2-step=612260.ckpt", **kwargs).to(device)
		return model
	except OSError as e:
		print(f"Couldn't load the save for the model, training instead. ({e.__class__.__name__})")
		model = TAPKG(**kwargs).to(device)
	cpt = pl.callbacks.ModelCheckpoint(monitor="MAP", mode="max", dirpath="Checkpoints", save_top_k=1)
	trainer = pl.Trainer(
		gpus=1,
		check_val_every_n_epoch=1,
		callbacks=[
			cpt,
			pl.callbacks.EarlyStopping(monitor="MAP", mode="max", min_delta=.002, patience=2)
		],
		auto_lr_find=True
	)
	# noinspection PyTypeChecker
	trainer.fit(model)
	print(cpt.best_model_path, cpt.best_model_score)
	return model

def eval_link_completion(dataset):
	model = default_model(dataset)
	ranks = model.link_completion_rank()
	MAP(ranks, plot=True)

Right after the training eval_link_completion shows a MAP of 96%, when I load the model however it's back to 16%.

OS: KUbuntu 20.04
Packaging pip
Version 1.2.0

Angad Kalra · Answer 1 · Wed Feb 24 2021 04:22:15 GMT+0800 (China Standard Time)

I don't think the model you are returning is the trained model, it's the original model from when you first create it. Try doing
model = model.load_from_checkpoint(cpt.best_model_path) after trainer.fit() and return model then.

Teo Orthlieb · Answer 2 · Wed Feb 24 2021 04:30:20 GMT+0800 (China Standard Time)

Thank you for answering, but I'm not sure I understand.
Why would loading like I do yield the original untrained model ?
How would you suggest that I load my model instead ?
The training takes 6h so it would be nice if I could avoid calling trainer.fit() again, but if it's mandatory for debugging I will.

Teo Orthlieb · Answer 3 · Wed Feb 24 2021 04:41:10 GMT+0800 (China Standard Time)

Wait after digging on my end it seems I might have a non deterministic process throwing the evaluation off between the runs, thank you anyway, I'll come back after investigating this.

Angad Kalra · Answer 4 · Wed Feb 24 2021 04:42:26 GMT+0800 (China Standard Time)

Do this in default_model function:

# noinspection PyTypeChecker
trainer.fit(model)
print(cpt.best_model_path, cpt.best_model_score)
model = model.load_from_checkpoint(cpt.best_model_path)
return model

Teo Orthlieb · Answer 5 · Wed Feb 24 2021 10:14:22 GMT+0800 (China Standard Time)

Yep I'm sorry, my loading/saving code was good, I just had another issue somewhere, thanks for your time

Sunit · Answer 6 · Sun Nov 13 2022 08:41:09 GMT+0800 (China Standard Time)

Hi I'm facing the same issue. Could you tell me what other potential issues could cause this ?

Teo Orthlieb · Answer 7 · Mon Nov 14 2022 03:30:23 GMT+0800 (China Standard Time)

I'm afraid i can't help you, it's been more than a year and I'd be completely unable to remember what the problem was

anushka · Answer 8 · Mon May 06 2024 01:20:04 GMT+0800 (China Standard Time)

same problem