Training falled after 60000 iterations

Question

Training falled after 60000 iterations

qiuqiangkong opened this issue 5 years ago · comments

Hi, really brilliant code! when I run the pytorch implementation, I found the trianing and validation loss increases dramtically after 60000 iterations. The loss curve looks like: https://drive.google.com/open?id=1_64-jD3hOtXmrOoVMq5hD8pWfmEUc7yE

Do you have any idea of this increased loss? Thank you very much!

Qiuqiang

CJ Carr · Answer 1 · Thu Dec 06 2018 10:45:32 GMT+0800 (China Standard Time)

The pytorch implementation doesn’t behave exactly the same as theano version. For example, they implement normalization differently. And even if they were coded the same, frameworks make different-level assumptions, and it doesnt translate perfectly. My pytorch version runs faster but the quality isnt as good. It’s notoriously hard to debug moving a model from one framework to another. One thing the pytorch version could benefit from is adding skip connections. I recommend doing it and submitting a PR :) As for why gradient descent does that sometimes, good question. I don’t know how to debug that. Maybe it got stuck in saddle and finally freed itself? No idea. Maybe try a different optimizer or learning rate. I saw a good poster at ICLR about cyclic learning rates.

…

On Wed, Dec 5, 2018 at 9:44 AM qiuqiangkong ***@***.***> wrote: Hi, really brilliant code! when I run the pytorch implementation, I found the trianing and validation loss increases dramtically after 60000 iterations. The loss curve looks like: https://drive.google.com/open?id=1_64-jD3hOtXmrOoVMq5hD8pWfmEUc7yE Do you have any idea of this increased loss? Thank you very much! Qiuqiang — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACVxZZVIWcy-8AearUphTnFRGeBgpyAUks5u19vlgaJpZM4ZC34P> .

qiuqiangkong · Answer 2 · Mon Dec 17 2018 19:47:30 GMT+0800 (China Standard Time)

Many thanks for the reply! I fixed out the problem. In the adam optimizer, I need to set this flag to true: amsgrad=True. Then it will fix this problem.

Guo Zixun Nicolas · Answer 3 · Thu Sep 26 2019 12:45:19 GMT+0800 (China Standard Time)

@qiuqiangkong Hi qiuqiang, I am using tensorflow to implement SampleRNN and Ive encountered the same problem, however I dont think in tf.train.adamoptimizer they have this amsgrad flag. Any idea how to troubleshoot this? Thanks in advance!

Guo Zixun Nicolas · Answer 4 · Thu Sep 26 2019 12:51:06 GMT+0800 (China Standard Time)

something like this

qiuqiangkong · Answer 5 · Thu Sep 26 2019 15:18:15 GMT+0800 (China Standard Time)

@zguo008 Interesting! I think keras has the amsgrad flag. Or try Rmsprop optimizer instead. I guess this phenomenon is caused by optimization. Let me know if you still not solve this problem.

Guo Zixun Nicolas · Answer 6 · Sat Sep 28 2019 11:31:44 GMT+0800 (China Standard Time)

Hi qiuqiang, thanks for your reply:) I’ll try RMSprop later. I tuned down value of epsilon in Adam optimizer and it seems that there’s no sudden increase in loss for now. Thank you for ur answer! It’s some great help

qiuqiangkong · Answer 7 · Sun Sep 29 2019 02:44:53 GMT+0800 (China Standard Time)

@zguo008 That is great! What epsilon are you using now?