Initialization Routines

Question

Initialization Routines

cicdw opened this issue 7 years ago · comments

An all-too-often ignored side of optimization is the initialization; there is a lot of research out there suggesting that for both convex (and even more so for non-convex) optimization problems, a large amount of work can be saved by initializing algorithms at clever starting values.

Currently we are initializing all algorithms with the 0 vector. Once the API (#11) is sorted out, we should have multiple options for how to initialize, including (but not limited to):

random Gaussian initializations
running some other, faster algorithm at a very low tolerance (ex: initialization Newton with the output of gradient descent set at a very low tolerance setting)
outputs of previous runs (will be built into a refit method, to be raised in a future issue)
more interesting but academically well-grounded ideas

cc: @mpancia

Matthew Pancia · Answer 1 · Sat Mar 18 2017 05:11:00 GMT+0800 (China Standard Time)

As we discussed earlier, I think this is a really cool idea, and I'm glad to be part of the discussion.

As a novice to this (and for the purposes of furthering a discussion), do you know any good surveys of what the academically well-grounded things look like and/or some higher-level discussions of the benefits of Smart Initialization™?

Chris White · Answer 2 · Tue Mar 21 2017 01:08:31 GMT+0800 (China Standard Time)

No surveys that I know of unfortunately, but here's a list off the top of my head:

I've heard people say you can use the close connection between LDA and Logistic Regression to initialize one with the other (I haven't thought about too much about speed / efficiency trade-offs here)
In the non-convex case, there's the "famous" k-means++
the starting guess for the approximation to the Hessian in BFGS can have significant consequences for the convergence of the algorithm
you can exploit the close connection between W-OLS and Logistic Regression to infer things about variable addition / dropping, which is related to multiple refits (see the Logistic Regression chapter in Elements of Statistical Learning)

Ultimately, I think the biggest bang will come from smart initializations when refitting a model, but I'd like to include at least a little thought on initializations from scratch as well.