Restarting interrupted training / checkpoints

Question

Restarting interrupted training / checkpoints

deccolquitt opened this issue 2 years ago · comments

Is there anyway to restart interrupted training? I can't see a checkpoint-related command relating to train_main.py

galgreshler · Answer 1 · Wed May 18 2022 20:31:30 GMT+0800 (China Standard Time)

Hi, I haven't implemented such a feature, but it should be quite easy: since training is done for each scale independently, and for each finished scale the networks are saved, you could start training from the last finished scale. If you decide to implement this I would be happy to add this feature to the repository.
Gal

deccolquitt · Answer 2 · Sun May 22 2022 00:42:23 GMT+0800 (China Standard Time)

Unfortunately I don't know enough about coding to do this, whenever I have tried using the same dataset it just creates a new directory and starts from scratch. Thanks anyway.

deccolquitt · Answer 3 · Tue May 31 2022 06:09:49 GMT+0800 (China Standard Time)

would this be the right sort of thing to look at?: [https://stackoverflow.com/questions/42703500/best-way-to-save-a-trained-model-in-pytorch]

galgreshler · Answer 4 · Thu Jun 02 2022 14:33:59 GMT+0800 (China Standard Time)

Yes, but this is already done during training. If you want to implement continuation of existing model, you have to load its saved networks and continue training of the following scales.