Decaying Partial Parameter

Question

Decaying Partial Parameter

akaniklaus opened this issue 5 years ago · comments

Given that there are previous research that supports switching from Adam to SGD at later stages of training for better generalization (https://arxiv.org/abs/1712.07628), do you think that it makes sense to start from the maximum value (0.5) of the partial parameter and then decay it to a hypertuned value over epochs? I have done a basic implementation of this already but I would be glad to have your opinion on it.

PSU Trustworthy Machine Learning Lab · Answer 1 · Tue Jun 23 2020 01:13:39 GMT+0800 (China Standard Time)

Sorry for the late reply... yes, we believe it could further improve the performance. We didn't do so as the downside is that it will further introduce extra parameters and tuning efforts