IanTaehoonYoo / semantic-segmentation-pytorch

Pytorch implementation of FCN, UNet, PSPNet, and various encoder models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to make checkpoint and save the model?

leocd91 opened this issue · comments

great work. Clear documentation and easy to set up.
How to make checkpoints and continue to train it later?
I'm currently in a place where the electricity is unreliable.

Hi, @leocd91

Thanks for the interest in this project.
I added how to save and load checkpoints in the README.

Saving and loading check points.

The trainer class can save the check point automatically depends on the argument is called 'check_point_epoch_stride'. So check points will be saved for every epoch stride in the runs folder, ./segmentation/runs/models.

Also, you can load the check point using the logger class. Here are example codes, please refer to as bellow.

"""
Save check point.
Please check the runs folder, ./segmentation/runs/models
"""
check_point_stride = 30 # check points are saved for every 30 epochs.

trainer = Trainer(model, optimizer, logger, num_epochs,
                      train_loader, test_loader, check_point_epoch_stride=check_point_stride)

"""
Load check point.
"""
model_name = "pspnet_mobilenet_v2"
n_classes = 33
logger = Logger(model_name="pspnet_mobilenet_v2", data_name='example')

model = all_models.model_from_name[model_name](n_classes)
logger.load_models(model, 'epoch_253')

Hi, I tried your example, fixed some line.

Like :
logger.load_models(model, 'epoch_253')

to

logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')   

Fixed the need more argument error but the model loaded gives me this error when accessing the size like your example on predict.py

AttributeError: 'PSPnet' object has no attribute 'img_height'

Here's my code ...

# Model
model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1
n_classes = 6
num_epochs = 300
image_axis_minimum_size = 200
pretrained = True
fixed_feature = False
batch_norm = False if batch_size == 1 else True
model = all_models.model_from_name[model_name](n_classes,
                                               batch_norm=batch_norm,
                                               pretrained=pretrained,
                                               fixed_feature=fixed_feature)
logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')    

model_width = model.img_width
model_height = model.img_height

if model_width != ori_width or model_height != ori_height:
  img = cv2.resize(img, (model_width, model_height), interpolation=cv2.INTER_NEAREST)

data = img.transpose((2, 0, 1))
data = data[None, :, :, :]
data = torch.from_numpy(data).float()

if next(model.parameters()).is_cuda:
  if not torch.cuda.is_available():
    raise ValueError("A model was trained via .cuda(), but this system can not support cuda.")
  data = data.cuda()

score = model(data)
lbl_pred = score.data.max(1)[1].cpu().numpy()[:, :, :]
lbl_pred = lbl_pred.transpose((1, 2, 0))
n_classes = np.max(lbl_pred)

Am I doing it wrong? I'm going to validate the model to another set of images/labels

No, it was my fault. I updated the project.
You should update this project and save the new checkpoint. Use this command: pip install --upgrade seg-torch
I tested all models to use the loading and saving module. But if you should have problems, plz let me know.

In addition, the function name was changed.

Logger.load_models(logger,model, 'epoch_240')  

to

Logger.load_model(logger,model, 'epoch_240')  

Thanks
Best regards,

Thank you for the response.

Still got some errors when loading the checkpoints after succesfully running it.

RuntimeError: Error(s) in loading state_dict for PSPnet:
	Missing key(s) in state_dict: "PSP.spatial_blocks.0.2.weight", "PSP.spatial_blocks.0.2.bias", "PSP.spatial_blocks.0.2.running_mean", "PSP.spatial_blocks.0.2.running_var", "PSP.spatial_blocks.1.2.weight", "PSP.spatial_blocks.1.2.bias", "PSP.spatial_blocks.1.2.running_mean", "PSP.spatial_blocks.1.2.running_var", "PSP.spatial_blocks.2.2.weight", "PSP.spatial_blocks.2.2.bias", "PSP.spatial_blocks.2.2.running_mean", "PSP.spatial_blocks.2.2.running_var", "PSP.spatial_blocks.3.2.weight", "PSP.spatial_blocks.3.2.bias", "PSP.spatial_blocks.3.2.running_mean", "PSP.spatial_blocks.3.2.running_var", "PSP.bottleneck.1.weight", "PSP.bottleneck.1.bias", "PSP.bottleneck.1.running_mean", "PSP.bottleneck.1.running_var", "upsampling1.layer.1.weight", "upsampling1.layer.1.bias", "upsampling1.layer.1.running_mean", "upsampling1.layer.1.running_var", "upsampling2.layer.1.weight", "upsampling2.layer.1.bias", "upsampling2.layer.1.running_mean", "upsampling2.layer.1.running_var", "upsampling3.layer.1.weight", "upsampling3.layer.1.bias", "upsampling3.layer.1.running_mean", "upsampling3.layer.1.running_var". 

here's my code :

model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1  
n_classes = 6
num_epochs = 22
check_point_stride = 21 
image_axis_minimum_size = 200
pretrained = True
fixed_feature = True
logger = Logger(model_name="pspnet_resnet50", data_name='test2')

model = all_models.model_from_name[model_name](n_classes)
logger.load_model(model, 'epoch_21')       

model.to(device)

# Loader
compose = transforms.Compose([
    Rescale(image_axis_minimum_size),
    ToTensor()
     ])

train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

trainer = Trainer(model, optimizer, logger, num_epochs, train_loader, test_loader,check_point_epoch_stride=check_point_stride)
trainer.train()

Could you check the Logger's arguments?
When you load the checkpoint, 'model_name' and 'data_name' should be the same as when you train the model.

Yeah it's the same,

Strangely, after I use this code instead, it's working 🤣

    model_name = "pspnet_resnet50"
    device = 'cuda'
    batch_size = 1  
    n_classes = 6
    num_epochs = 22
    check_point_stride = 21 
    image_axis_minimum_size = 200
    pretrained = True
    fixed_feature = True

    logger = Logger(model_name=model_name, data_name='test3')

    # Loader
    compose = transforms.Compose([
        Rescale(image_axis_minimum_size),
        ToTensor()
         ])

    train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
    train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

    test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
    test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

     # Model
    batch_norm = False if batch_size == 1 else True
    model = all_models.model_from_name[model_name](n_classes,
                                                   batch_norm=batch_norm,
                                                   pretrained=pretrained,
                                                   fixed_feature=fixed_feature)
    logger.load_model(model, 'epoch_21')  
    model.to(device)

Thank you..

I guess you trained the model setting batch_norm = True and you loaded the model was set batch_norm=False.
It will be changed the network layers depend on batch_norm. Sorry for the error...

Thanks,