SGD x Adam

Question

SGD x Adam

pvcastro opened this issue 6 years ago · comments

Pedro Vitor Quinta de Castro commented 6 years ago

Do you have any theories as to why, in your implementation, SGD is performing better than Adam optimizer (or any other optimizers, for that matter)? Do you think it's related to not having batch processing implemented?

Thanks!

Guillaume Lample · Answer 1 · Mon Jul 16 2018 22:02:27 GMT+0800 (China Standard Time)

Hi,

My experience in general (and I know that many people had similar observations), is that SGD is what works best with batch size 1. Batch size 1 is also what works best in general, but people use bigger batch size (like 32 or 128) for training speed. When using bigger batch sizes, Adam usually gives better results than SGD. But well, this also depends a bit on the task.. But for NER I always observed that SGD was significantly the best.

Pedro Vitor Quinta de Castro · Answer 2 · Mon Jul 16 2018 22:15:36 GMT+0800 (China Standard Time)

Ok, thanks!
I'm presenting a paper based on your LSTM-CRF architecture on a conference for Portuguese NLP in september ("Portuguese Named Entity Recognition using LSTM-CRF" - http://www.inf.ufrgs.br/propor-2018/accepted-papers/), so I'm getting ready for it. If you have any tips, they would be most welcome! Thanks!