Resuming from checkpoint is ignored?

Question

Resuming from checkpoint is ignored?

rdrighetto opened this issue 3 months ago · comments

When trying to resume ddw fit-model from a previous checkpoint, it seems to ignore it and start fitting from scratch again. I am adding the resume-from-checkpoint line to my config.yaml file:

fit_model:
    unet_params_dict:
      chans: 32
      num_downsample_layers: 3
      drop_prob: 0.1
    adam_params_dict: 
      lr: 0.0004
    num_epochs: 600
    batch_size: 10
    update_subtomo_missing_wedges_every_n_epochs: 10
    check_val_every_n_epochs: 10
    save_n_models_with_lowest_val_loss: 5
    save_n_models_with_lowest_fitting_loss: 5
    save_model_every_n_epochs: 50
    logger: "csv"
    resume-from-checkpoint: "ddw_imod/logs/version_0/checkpoints/epoch/epoch=399.ckpt"

I know it's starting from scratch because the fitting loss at this epoch was 0.14867860078811646 in the previous run (which crashed), but upon starting this new run it says the fitting loss is 0.1634301096200943 in the first epoch. Does that make sense?

Thank you!

Simon Wiedemann · Answer 1 · Thu Jun 27 2024 00:46:13 GMT+0800 (China Standard Time)

Hi Ricardo,

thanks for opening this issue! I have just tried to resume fitting from a checkpoint and also noticed that the loss in the early epochs of the resumed fitting is a bit higher than the one corresponding to the checkpoint I resumed from. However, the loss is not as high as in the very first epoch.

In your case, was 0.16 the loss in the first epoch of the crashed run? If the loss in the resumed fitting is lower than the one in the first epoch, I would recommend sticking with the resumed fitting as long as the loss is decreasing nicely.

Unfortunately, it seems to me that this issue may be related to PyTorch Lightning (which is the deep learning framework DDW is based on) itself rather than to my own code. I will do some more research on this (maybe updating to a newer version or something like that might help) and will get back to you when I know more.

Best,
Simon

Ricardo Righetto · Answer 2 · Thu Jun 27 2024 03:32:09 GMT+0800 (China Standard Time)

Hi Simon, thanks for the quick reply!

I unfortunately deleted the logs from the crashed run already, but I'm almost sure that the loss in the first epoch was higher than 0.16, so it should be fine according to your explanation. Do you know if there is any other way we can confirm that it's resuming from the supplied checkpoint correctly?

Thank you!

Simon Wiedemann · Answer 3 · Thu Jun 27 2024 15:39:19 GMT+0800 (China Standard Time)

I think as long as PyTorch Lightning's epoch counter (the one next to the progress bar) in the resumed run starts at the epoch of the checkpoint you're resuming from, you can assume that PyTorch Lightning has "correctly" resumed the fitting.

Another thing that occurred to me is that fitting loss typically fluctuates quite a bit between epochs, so maybe the jump from 0.148 to 0.163 you observed was within the loss fluctuation range of the fitting at the time of the crash. We'd need the logs of the crashed run to verify this, but it might be good to keep this in mind for the future.

Ricardo Righetto · Answer 4 · Thu Jun 27 2024 21:00:26 GMT+0800 (China Standard Time)

Hi Simon,

I just did a test and the epoch counter always starts from 0, even if I'm using the resume-from-checkpoint: entry in the config file. That's why I'm still not completely sure that it's resuming the fitting from the specified checkpoint. Maybe I'm specifying some other option that makes it ignore the provided checkpoint and start from scratch? Not sure what could that be, though.

Simon Wiedemann · Answer 5 · Fri Jun 28 2024 02:59:26 GMT+0800 (China Standard Time)

Hi Ricardo,

I think found the solution to your problem. It seems that in the config file, all arguments have to be given using underscores, i.e. resume_from_checkpoint, instead of hyphens, i.e. resume-from-checkpoint. Apparently, the hyphen notation only works when arguments are given directly on the command line.
This seems to be a property of typer, the framework I used to build the command line interface. I am sorry for the trouble you had, I did not know about this. I will try to develop an "argument sanity checker" to prevent such problems in the future.

Thank you so much for helping me to improve DeepDeWedge one GitHub issue at a time :)

Best,
Simon

Ricardo Righetto · Answer 6 · Fri Jun 28 2024 03:01:56 GMT+0800 (China Standard Time)

Ah, that makes sense! My bad 😬
In any case, a sanity checker would be nice! Thanks a lot helping me get the best out of DDW! It's really an amazing program!