Being able to save and continue training a model

Question

Being able to save and continue training a model

HansBambel opened this issue 6 years ago · comments

Training takes a long time. If it crashes intermediately we should be able to continue where it crashed without restarting the whole training.

dannigt · Answer 1 · Wed Jan 09 2019 23:52:29 GMT+0800 (China Standard Time)

@adrigrillo, when restoring training with DDQN, are the following variables enough?

current_model
target_model
optimizer
memory
Can I resume target_model to the same as current_model (from the latest check point)?

Adrián Rodríguez Grillo · Answer 2 · Thu Jan 10 2019 00:06:57 GMT+0800 (China Standard Time)

With the target model should be enough. However, you have to copy the target in the current model during the process of loading a pretained model. With regards to the optimizer and the memory is not necessary, better not to save them.

…

On Wed, 9 Jan 2019, 16:52 dannigt ***@***.*** wrote: @adrigrillo <https://github.com/adrigrillo>, when restoring training with DDQN, are the following variables enough? - current_model - target_model - optimizer - memory Can I resume target_model to the same as current_model (from the latest check point)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARWvRtuJbev6vXINwfTvpow_aHv4Whbaks5vBhA9gaJpZM4Z1wjI> .

dannigt · Answer 3 · Thu Jan 10 2019 00:45:04 GMT+0800 (China Standard Time)

With the target model should be enough. However, you have to copy the target in the current model during the process of loading a pretained model. With regards to the optimizer and the memory is not necessary, better not to save them.
…
On Wed, 9 Jan 2019, 16:52 dannigt @.*** wrote: @adrigrillo https://github.com/adrigrillo, when restoring training with DDQN, are the following variables enough? - current_model - target_model - optimizer - memory Can I resume target_model to the same as current_model (from the latest check point)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ARWvRtuJbev6vXINwfTvpow_aHv4Whbaks5vBhA9gaJpZM4Z1wjI .

Alright, thx!
I thought ADAM adjusts the learning rate depending on the current epoch/episode. Is it the case here?

dannigt · Answer 4 · Thu Jan 10 2019 22:28:00 GMT+0800 (China Standard Time)

TODO's

also save representation learner (write out separately)
overwrite checkpoint
change checkpoint-loading from doc to dir