How to resume training from a certain epoch or how to update an already trained model in torch ?

Question

How to resume training from a certain epoch or how to update an already trained model in torch ?

maryam089 opened this issue 7 years ago · comments

I have alexnet model which is trained on 100K images now i want to update this model by adding few thousand more images to it. But when i tried to load the model and start training, it gives me following error.... any help ????

/home/maryam/torch/install/bin/lua: /home/maryam/torch/install/share/lua/5.2/nn/Module.lua:327: check that you are sharing parameters and gradParameters
stack traceback:
[C]: in function 'assert'
/home/maryam/torch/install/share/lua/5.2/nn/Module.lua:327: in function 'getParameters'
train.lua:270: in main chunk
[C]: in function 'dofile'
...ryam/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?

K Y Sunami · Answer 1 · Fri Nov 13 2020 17:14:39 GMT+0800 (China Standard Time)

Check this out https://debuggercafe.com/effective-model-saving-and-resuming-training-in-pytorch/
You can first save the checkpoint and reload when your want to resume training. Hope this helps.