uclaml / Padam

Partially Adaptive Momentum Estimation method in the paper "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" (accepted by IJCAI 2020)

Home Page:https://arxiv.org/abs/1806.06763

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decaying Partial Parameter

akaniklaus opened this issue · comments

Given that there are previous research that supports switching from Adam to SGD at later stages of training for better generalization (https://arxiv.org/abs/1712.07628), do you think that it makes sense to start from the maximum value (0.5) of the partial parameter and then decay it to a hypertuned value over epochs? I have done a basic implementation of this already but I would be glad to have your opinion on it.

Sorry for the late reply... yes, we believe it could further improve the performance. We didn't do so as the downside is that it will further introduce extra parameters and tuning efforts