How to make checkpoint and save the model?

Question

How to make checkpoint and save the model?

leocd91 opened this issue 4 years ago · comments

great work. Clear documentation and easy to set up.
How to make checkpoints and continue to train it later?
I'm currently in a place where the electricity is unreliable.

IanYoo · Answer 1 · Fri Aug 07 2020 05:42:53 GMT+0800 (China Standard Time)

Hi, @leocd91

Thanks for the interest in this project.
I added how to save and load checkpoints in the README.

Saving and loading check points.

The trainer class can save the check point automatically depends on the argument is called 'check_point_epoch_stride'. So check points will be saved for every epoch stride in the runs folder, ./segmentation/runs/models.

Also, you can load the check point using the logger class. Here are example codes, please refer to as bellow.

"""
Save check point.
Please check the runs folder, ./segmentation/runs/models
"""
check_point_stride = 30 # check points are saved for every 30 epochs.

trainer = Trainer(model, optimizer, logger, num_epochs,
                      train_loader, test_loader, check_point_epoch_stride=check_point_stride)

"""
Load check point.
"""
model_name = "pspnet_mobilenet_v2"
n_classes = 33
logger = Logger(model_name="pspnet_mobilenet_v2", data_name='example')

model = all_models.model_from_name[model_name](n_classes)
logger.load_models(model, 'epoch_253')

Leo Dinendra · Answer 2 · Fri Aug 07 2020 06:51:18 GMT+0800 (China Standard Time)

Hi, I tried your example, fixed some line.

Like :
logger.load_models(model, 'epoch_253')

to

logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')

Fixed the need more argument error but the model loaded gives me this error when accessing the size like your example on predict.py

AttributeError: 'PSPnet' object has no attribute 'img_height'

Here's my code ...

# Model
model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1
n_classes = 6
num_epochs = 300
image_axis_minimum_size = 200
pretrained = True
fixed_feature = False
batch_norm = False if batch_size == 1 else True
model = all_models.model_from_name[model_name](n_classes,
                                               batch_norm=batch_norm,
                                               pretrained=pretrained,
                                               fixed_feature=fixed_feature)
logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')    

model_width = model.img_width
model_height = model.img_height

if model_width != ori_width or model_height != ori_height:
  img = cv2.resize(img, (model_width, model_height), interpolation=cv2.INTER_NEAREST)

data = img.transpose((2, 0, 1))
data = data[None, :, :, :]
data = torch.from_numpy(data).float()

if next(model.parameters()).is_cuda:
  if not torch.cuda.is_available():
    raise ValueError("A model was trained via .cuda(), but this system can not support cuda.")
  data = data.cuda()

score = model(data)
lbl_pred = score.data.max(1)[1].cpu().numpy()[:, :, :]
lbl_pred = lbl_pred.transpose((1, 2, 0))
n_classes = np.max(lbl_pred)

Am I doing it wrong? I'm going to validate the model to another set of images/labels

IanYoo · Answer 3 · Fri Aug 07 2020 13:36:26 GMT+0800 (China Standard Time)

No, it was my fault. I updated the project.
You should update this project and save the new checkpoint. Use this command: pip install --upgrade seg-torch
I tested all models to use the loading and saving module. But if you should have problems, plz let me know.

In addition, the function name was changed.

Logger.load_models(logger,model, 'epoch_240')

to

Logger.load_model(logger,model, 'epoch_240')

Thanks
Best regards,

Leo Dinendra · Answer 4 · Sat Aug 08 2020 00:47:07 GMT+0800 (China Standard Time)

Thank you for the response.

Still got some errors when loading the checkpoints after succesfully running it.

RuntimeError: Error(s) in loading state_dict for PSPnet:
	Missing key(s) in state_dict: "PSP.spatial_blocks.0.2.weight", "PSP.spatial_blocks.0.2.bias", "PSP.spatial_blocks.0.2.running_mean", "PSP.spatial_blocks.0.2.running_var", "PSP.spatial_blocks.1.2.weight", "PSP.spatial_blocks.1.2.bias", "PSP.spatial_blocks.1.2.running_mean", "PSP.spatial_blocks.1.2.running_var", "PSP.spatial_blocks.2.2.weight", "PSP.spatial_blocks.2.2.bias", "PSP.spatial_blocks.2.2.running_mean", "PSP.spatial_blocks.2.2.running_var", "PSP.spatial_blocks.3.2.weight", "PSP.spatial_blocks.3.2.bias", "PSP.spatial_blocks.3.2.running_mean", "PSP.spatial_blocks.3.2.running_var", "PSP.bottleneck.1.weight", "PSP.bottleneck.1.bias", "PSP.bottleneck.1.running_mean", "PSP.bottleneck.1.running_var", "upsampling1.layer.1.weight", "upsampling1.layer.1.bias", "upsampling1.layer.1.running_mean", "upsampling1.layer.1.running_var", "upsampling2.layer.1.weight", "upsampling2.layer.1.bias", "upsampling2.layer.1.running_mean", "upsampling2.layer.1.running_var", "upsampling3.layer.1.weight", "upsampling3.layer.1.bias", "upsampling3.layer.1.running_mean", "upsampling3.layer.1.running_var".

here's my code :

model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1  
n_classes = 6
num_epochs = 22
check_point_stride = 21 
image_axis_minimum_size = 200
pretrained = True
fixed_feature = True
logger = Logger(model_name="pspnet_resnet50", data_name='test2')

model = all_models.model_from_name[model_name](n_classes)
logger.load_model(model, 'epoch_21')       

model.to(device)

# Loader
compose = transforms.Compose([
    Rescale(image_axis_minimum_size),
    ToTensor()
     ])

train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

trainer = Trainer(model, optimizer, logger, num_epochs, train_loader, test_loader,check_point_epoch_stride=check_point_stride)
trainer.train()

IanYoo · Answer 5 · Sat Aug 08 2020 01:53:24 GMT+0800 (China Standard Time)

Could you check the Logger's arguments?
When you load the checkpoint, 'model_name' and 'data_name' should be the same as when you train the model.

Leo Dinendra · Answer 6 · Sat Aug 08 2020 02:51:38 GMT+0800 (China Standard Time)

Yeah it's the same,

Strangely, after I use this code instead, it's working 🤣

    model_name = "pspnet_resnet50"
    device = 'cuda'
    batch_size = 1  
    n_classes = 6
    num_epochs = 22
    check_point_stride = 21 
    image_axis_minimum_size = 200
    pretrained = True
    fixed_feature = True

    logger = Logger(model_name=model_name, data_name='test3')

    # Loader
    compose = transforms.Compose([
        Rescale(image_axis_minimum_size),
        ToTensor()
         ])

    train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
    train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

    test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
    test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

     # Model
    batch_norm = False if batch_size == 1 else True
    model = all_models.model_from_name[model_name](n_classes,
                                                   batch_norm=batch_norm,
                                                   pretrained=pretrained,
                                                   fixed_feature=fixed_feature)
    logger.load_model(model, 'epoch_21')  
    model.to(device)

Thank you..

IanYoo · Answer 7 · Sat Aug 08 2020 03:09:05 GMT+0800 (China Standard Time)

I guess you trained the model setting batch_norm = True and you loaded the model was set batch_norm=False.
It will be changed the network layers depend on batch_norm. Sorry for the error...

Thanks,