Before running the benchmark, run the following:
mkdir -p data
This benchmark tests a simple case where the data matrix X
has rows sampled from N(0, I_p)
and y = X * beta + eps
where eps ~ N(0, I_n)
and beta
has some true sparsity (i.e. some proportion of the entries are 0).
See make_data.py for details.
Note that we standardize the columns of X
.
The files bench_glum.py
and bench_glmnet.R
provide two functions get_data
and timer
.
The get_data
function simply reads the generated data, which is stored by default in data/
,
and returns (X,y)
pair.
Please read carefully the settings of the lasso solver.
A couple notes:
- The current setting benchmarks pathwise solution.
- We supply the character string
"gaussian"
toglmnet
to invoke the C++ routine. - Both methods do not standardize
X
. - We only tested the lasso (
l1_ratio=1
inglum
andalpha=1
(default) inglmnet
). min_alpha_ratio
inglum
(orlambda.min.ratio
inglmnet
) was fixed to be the same behavior as the default behavior inglmnet
. This way, the regularization path is exactly the same.- We fix the solver to be
irls-cd
forglum
, just to make it clear. According to the documentation, it usesirls-cd
anyways. - Because of early-stopping rules for
glmnet
, the model may not be fit on all the number of points on the regularization path. To make it an absolutely fair benchmark, we also changed then_alphas
parameter inglum
to match the total number of points realized after fitting withglmnet
. - Both methods use warm-starts.
- Convergence criterion:
glum
's_cd_fast.py
in the source code seems to take the absolute difference inbeta_i
values after each coordinate descent and checks if that is below the threshold.glmnet
takes the maximum of the squared difference inbeta_i
after each coordinate descent and checks if that is below the threshold. This is why the tolerance forglmnet
is that ofglum
squared.
The workflow is as follows:
- Go in each of the files
bench_glum.py
,bench_glmnet.R
,make_data.py
and changen, p
. - If you haven't created data for the current value of
n, p
, run:
python make_data.py
- Run the benchmarks separately:
python bench_glum.py
Rscript bench_glmnet.R
- Double-check that the outputs match (more or less), the number of iterations has not reached the max, and compare the elapsed times.