Decaying Partial Parameter
akaniklaus opened this issue · comments
Given that there are previous research that supports switching from Adam to SGD at later stages of training for better generalization (https://arxiv.org/abs/1712.07628), do you think that it makes sense to start from the maximum value (0.5) of the partial parameter and then decay it to a hypertuned value over epochs? I have done a basic implementation of this already but I would be glad to have your opinion on it.
Sorry for the late reply... yes, we believe it could further improve the performance. We didn't do so as the downside is that it will further introduce extra parameters and tuning efforts