scikit-learn-contrib / skglm

Fast and modular sklearn replacement for generalized linear models

Home Page:http://contrib.scikit-learn.org/skglm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Warm start has no effect on GramCD solver

carlosg-m opened this issue · comments

Setting warm_start to True using GramCD solver seems to have no effect on the time it takes to fit an estimator. Even when using the same dataset repeatedly. The point here is to speed up fitting time by sharing model parameters between folds using sliding window cross-validation.

Code to reproduce:

from skglm.solvers import GramCD
from skglm.datafits import Quadratic
from skglm.penalties import L1_plus_L2
from skglm.estimators import GeneralizedLinearEstimator

model = GeneralizedLinearEstimator(
                        datafit=Quadratic(),
                        penalty=L1_plus_L2(alpha=alpha, l1_ratio=l1_ratio, positive=False),
                        solver=GramCD(max_iter=1000, fit_intercept=False, greedy_cd=True, use_acc=False, warm_start=True))

Also is it advised to convert x and y inputs to Fortran-contiguous numpy arrays, similar to scikit-learn, to save memory?

x = np.asfortranarray(x)
y = np.asfortranarray(y)

Thanks @carlosg-m for reporting that!

You are right, warm_start is useful to leverage an already computed solution, typically in cross validation or along a regularization path.

In your example, do you ensure to pass in w (solution) to the solver at each cross validation iteration?
To further help you, could you provide a full example to reproduce the behavior ?

Also is it advised to convert x and y inputs to Fortran-contiguous numpy arrays, similar to scikit-learn, to save memory?

It is more for computational speed-up since GramCD solver slices the matrix X column-wise.

Here's a modified version of your script, with two calls to fit ; since the value of alpha does not change, you can see the solver does nothing in the second fit:

import numpy as np

from skglm.solvers import GramCD
from skglm.datafits import Quadratic
from skglm.penalties import L1_plus_L2
from skglm.estimators import GeneralizedLinearEstimator

np.random.seed(0)
X = np.random.randn(100, 200)
y = np.random.randn(100)


l1_ratio = 0.9
alpha_max = np.linalg.norm(X.T @ y) / (l1_ratio * len(y))

alpha = 0.05 * alpha_max

solver = GramCD(max_iter=1000, fit_intercept=False,
                greedy_cd=True, use_acc=False, warm_start=True, verbose=3)
model = GeneralizedLinearEstimator(
    datafit=Quadratic(),
    penalty=L1_plus_L2(alpha=alpha, l1_ratio=l1_ratio, positive=False),
    solver=solver)

print("First fit:")
model.fit(X, y)
print("Second fit:")
model.fit(X, y)

Output:

First fit:
Iteration 1: 0.4647719764, stopping crit: 2.30e-01
Iteration 2: 0.3285817698, stopping crit: 1.11e-03
Iteration 3: 0.3285482999, stopping crit: 1.80e-04
Iteration 4: 0.3285474862, stopping crit: 3.04e-05
Stopping criterion max violation: 3.04e-05
Second fit:
Iteration 1: 0.3285474862, stopping crit: 3.04e-05
Stopping criterion max violation: 3.04e-05

As @Badr-MOUFAD we'd need a more complete script to investigate the behavior you're mentioning.

Also, from a statistical point of view, beware that using warm start between folds leaks data from one fold to the other, which you do not want.

@Badr-MOUFAD and @mathurinm, you are both right. I'm going to close the issue. I suspected there was a bug because fit time did not change but after trying myself an example with increased verbosity it is clear that warm start is working, because the second and posterior fits only take 1 iteration. However fit time stays more or less the same because of the high efficiency of the implementation?

@mathurinm, can you please eplain this heuristic to determine max alpha?
alpha_max = np.linalg.norm(X.T @ y) / (l1_ratio * len(y))

Skglm is currently being deployed to forecast all medium voltage load diagrams from distribution transformer stations and client transformer stations in Portugal (almost 200K time series distributed using Spark). The model is a simple and custom made AR, the most important features are the past lags. I have more details of the implementation in other posts. We were using scikit-learn Elastic Net, but this implementation proved to be much faster.

Our dataset has always more instances/rows than features. So by using the precomputed Gram matrix allows for a considerable speedup. I don't know if there are any disadvantages when compared with Andersen Coordinate Descent, but in practice it's working great.

Also, from a statistical point of view, beware that using warm start between folds leaks data from one fold to the other, which you do not want.

@mathurinm, since this is a timeseries forecasting problem, the crossvalidation process is sequential and without shuffle (very similar to walk-forward cross validation and this this Prophet implementation), so the only leakage comes from past data, for example the weights of a model fitted in January with one year of history are then reused to speedup the February fit, again with one year of history (sliding window). However we had to drop this approach because of the new version of the AR model includes feature selection using ACF/CCF with prewhitening, this means that the features/lags selected can change between folds.

  • @mathurinm, can you please eplain this heuristic to determine max alpha?
    alpha_max = np.linalg.norm(X.T @ y) / (l1_ratio * len(y))

    Interestingly this is not an heuristic but a theorem: alpha_max is the smallest value for which the solution of the Enet is guaranteed to be 0. The proof comes from the first order optimality condition (Fermat's rule); it is for example Proposition 4 in https://jmlr.org/papers/volume18/16-577/16-577.pdf it

  • We are extremely happy that skglm is used in a large scale application ; can we advertise it?

Interestingly this is not an heuristic but a theorem: alpha_max is the smallest value for which the solution of the Enet is guaranteed to be 0. The proof comes from the first order optimality condition (Fermat's rule); it is for example Proposition 4 in https://jmlr.org/papers/volume18/16-577/16-577.pdf it

Thank you once again @mathurinm. I'm training a new batch of models with the alpha_max modification. It is extremely interesting because it is yielding different parameters of alpha for different folds (almost like "alpha is being estimated through time"), and providing solutions with higher "l1_ratio". Before models would move closer to Ridge now models are closer to Lasso. In terms of results I'll have to wait.

The only caveat was setting a "alpha ratio" hyperparameter that has to assume extremely small values, for this particular dataset I used maximum value of "0.01". We are using Optuna for hyperparameter optimization.

If you have other tricks to speedup crossvalidation, save memory, improve results, or estimate hyperparameters please share. Our timeseries have 15 minute frequency, we forecast 5 days ahead using exactly 365 days of history. We use autoregression to extend ElasticNet predictions to future timesteps.

* We are extremely happy that skglm is used in a large scale application ; can we advertise it?
  • I've talked with my superiors, and we are also very interested in having this mention. My company is E-REDES a distribution system operator (DSO) part of EDP group. Please contact me directly if you need further information.

Hi @carlosg-m, where can I contact you ?

Hi @carlosg-m, where can I contact you ?

You can try this email:

We've actually made a small mention to skglm on a Databricks event in Madrid (Data Intelligence Day):
image

@mathurinm, can you please eplain this heuristic to determine max alpha?
alpha_max = np.linalg.norm(X.T @ y) / (l1_ratio * len(y))

@carlosg-m we have added an in-detail tutorial about this computation of the critical regularization parameter value for L1 models here : https://contrib.scikit-learn.org/skglm/tutorials/alpha_max.html#alpha-max