MLI-lab / DeepDeWedge

Self-supervised deep learning for denoising and missing wedge reconstruction of cryo-ET tomograms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Resuming from checkpoint is ignored?

rdrighetto opened this issue · comments

When trying to resume ddw fit-model from a previous checkpoint, it seems to ignore it and start fitting from scratch again. I am adding the resume-from-checkpoint line to my config.yaml file:

fit_model:
    unet_params_dict:
      chans: 32
      num_downsample_layers: 3
      drop_prob: 0.1
    adam_params_dict: 
      lr: 0.0004
    num_epochs: 600
    batch_size: 10
    update_subtomo_missing_wedges_every_n_epochs: 10
    check_val_every_n_epochs: 10
    save_n_models_with_lowest_val_loss: 5
    save_n_models_with_lowest_fitting_loss: 5
    save_model_every_n_epochs: 50
    logger: "csv"
    resume-from-checkpoint: "ddw_imod/logs/version_0/checkpoints/epoch/epoch=399.ckpt"

I know it's starting from scratch because the fitting loss at this epoch was 0.14867860078811646 in the previous run (which crashed), but upon starting this new run it says the fitting loss is 0.1634301096200943 in the first epoch. Does that make sense?

Thank you!

Hi Ricardo,

thanks for opening this issue! I have just tried to resume fitting from a checkpoint and also noticed that the loss in the early epochs of the resumed fitting is a bit higher than the one corresponding to the checkpoint I resumed from. However, the loss is not as high as in the very first epoch.

In your case, was 0.16 the loss in the first epoch of the crashed run? If the loss in the resumed fitting is lower than the one in the first epoch, I would recommend sticking with the resumed fitting as long as the loss is decreasing nicely.

Unfortunately, it seems to me that this issue may be related to PyTorch Lightning (which is the deep learning framework DDW is based on) itself rather than to my own code. I will do some more research on this (maybe updating to a newer version or something like that might help) and will get back to you when I know more.

Best,
Simon

Hi Simon, thanks for the quick reply!

I unfortunately deleted the logs from the crashed run already, but I'm almost sure that the loss in the first epoch was higher than 0.16, so it should be fine according to your explanation. Do you know if there is any other way we can confirm that it's resuming from the supplied checkpoint correctly?

Thank you!

I think as long as PyTorch Lightning's epoch counter (the one next to the progress bar) in the resumed run starts at the epoch of the checkpoint you're resuming from, you can assume that PyTorch Lightning has "correctly" resumed the fitting.

Another thing that occurred to me is that fitting loss typically fluctuates quite a bit between epochs, so maybe the jump from 0.148 to 0.163 you observed was within the loss fluctuation range of the fitting at the time of the crash. We'd need the logs of the crashed run to verify this, but it might be good to keep this in mind for the future.

Hi Simon,

I just did a test and the epoch counter always starts from 0, even if I'm using the resume-from-checkpoint: entry in the config file. That's why I'm still not completely sure that it's resuming the fitting from the specified checkpoint. Maybe I'm specifying some other option that makes it ignore the provided checkpoint and start from scratch? Not sure what could that be, though.

Hi Ricardo,

I think found the solution to your problem. It seems that in the config file, all arguments have to be given using underscores, i.e. resume_from_checkpoint, instead of hyphens, i.e. resume-from-checkpoint. Apparently, the hyphen notation only works when arguments are given directly on the command line.
This seems to be a property of typer, the framework I used to build the command line interface. I am sorry for the trouble you had, I did not know about this. I will try to develop an "argument sanity checker" to prevent such problems in the future.

Thank you so much for helping me to improve DeepDeWedge one GitHub issue at a time :)

Best,
Simon

Ah, that makes sense! My bad 😬
In any case, a sanity checker would be nice! Thanks a lot helping me get the best out of DDW! It's really an amazing program!