Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Model loaded from checkpoint has bad accuracy

Inspirateur opened this issue · comments

What is your question?

I have a model that I train with EarlyStopping and ModelCheckpoint on a custom metric (MAP).
The training works fine, after 2 epochs the model reaches 96% MAP however when I load it and test it with the exact same function the MAP is 16% (same as untrained model).
I must be doing something wrong, but what ?

Code

def default_model(dataset: str):
	if torch.cuda.is_available():
		print("Using the GPU")
		device = torch.device("cuda")
	else:
		print("Using the CPU")
		device = torch.device("cpu")
	kwargs = {
		"dataset": dataset, "embed_size": 50, "depth": 3,
		"vmap": Graph3D.from_dataset(dataset).vocabulary,
		"neg_per_pos": 5, "max_paths": 255, "device": device
	}
	try:
		model = TAPKG.load_from_checkpoint("Checkpoints/epoch=2-step=612260.ckpt", **kwargs).to(device)
		return model
	except OSError as e:
		print(f"Couldn't load the save for the model, training instead. ({e.__class__.__name__})")
		model = TAPKG(**kwargs).to(device)
	cpt = pl.callbacks.ModelCheckpoint(monitor="MAP", mode="max", dirpath="Checkpoints", save_top_k=1)
	trainer = pl.Trainer(
		gpus=1,
		check_val_every_n_epoch=1,
		callbacks=[
			cpt,
			pl.callbacks.EarlyStopping(monitor="MAP", mode="max", min_delta=.002, patience=2)
		],
		auto_lr_find=True
	)
	# noinspection PyTypeChecker
	trainer.fit(model)
	print(cpt.best_model_path, cpt.best_model_score)
	return model

def eval_link_completion(dataset):
	model = default_model(dataset)
	ranks = model.link_completion_rank()
	MAP(ranks, plot=True)

Right after the training eval_link_completion shows a MAP of 96%, when I load the model however it's back to 16%.

  • OS: KUbuntu 20.04
  • Packaging pip
  • Version 1.2.0

I don't think the model you are returning is the trained model, it's the original model from when you first create it. Try doing
model = model.load_from_checkpoint(cpt.best_model_path) after trainer.fit() and return model then.

Thank you for answering, but I'm not sure I understand.
Why would loading like I do yield the original untrained model ?
How would you suggest that I load my model instead ?
The training takes 6h so it would be nice if I could avoid calling trainer.fit() again, but if it's mandatory for debugging I will.

Wait after digging on my end it seems I might have a non deterministic process throwing the evaluation off between the runs, thank you anyway, I'll come back after investigating this.

Do this in default_model function:

# noinspection PyTypeChecker
trainer.fit(model)
print(cpt.best_model_path, cpt.best_model_score)
model = model.load_from_checkpoint(cpt.best_model_path)
return model

Yep I'm sorry, my loading/saving code was good, I just had another issue somewhere, thanks for your time

commented

Hi I'm facing the same issue. Could you tell me what other potential issues could cause this ?

I'm afraid i can't help you, it's been more than a year and I'd be completely unable to remember what the problem was

same problem