JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prediction on a linear model changed between version 1.3.11 and 1.4.0

piever opened this issue · comments

In particular the reported confidence interval is different. On GLM v1.3.11 I get

julia> using GLM

julia> x = [1.0 1.0
            1.0 2.0
            1.0 -1.0];

julia> y = [1.0, 3.0, -2.0];

julia> m = lm(x, y);

julia> predict(m, x, interval=:confidence)
(prediction = [1.2142857142857142, 2.857142857142857, -2.0714285714285716], lower = [-0.8151383947225999; -0.012896241623634452; -5.343776620915804], upper = [3.2437098232940285; 5.727181955909349; 1.200919478058661])

whereas v1.4.0 gives

julia> using GLM

julia> x = [1.0 1.0
            1.0 2.0
            1.0 -1.0];

julia> y = [1.0, 3.0, -2.0];

julia> m = lm(x, y);

julia> predict(m, x, interval=:confidence)
(prediction = [1.2142857142857146, 2.8571428571428585, -2.0714285714285725], lower = [-0.8151383947225981; -1.0989330286373797; -5.343776620915804], upper = [3.2437098232940276; 6.813218742923096; 1.2009194780586587])

I don't think it is just a numerical instability, larger examples, eg

julia> x = hcat(fill(1, 100), 1:100);

julia> y = [i + 0.1 * isodd(i) for i in 1:100];

yield even larger differences.

Actually that's because allowrankdeficient now defaults to true (and has been renamed to dropcollinear) since #386. With dropcollinear=false I get the same on 1.4.0 as on 1.3.11. Not sure exactly what happens under the hood.

@pdeffebach @mkborregaard @dmbates @andreasnoack @Nosferican @DilumAluthge Any ideas why prediction intervals are incorrect when allowrankdeficient=true/dropcollinear=true? Coefficients are not affected. I guess this line is the problem:

R = cholesky!(mm.pp).U #get the R matrix from the QR factorization

Nice detective work! I still think it's problematic that dropcollinear leads to radically different confidence intervals.

For a more realistic example (without collinear columns):

julia> using RDatasets, GLM

julia> mpg = dataset("ggplot2", "mpg");

julia> m = lm(hcat(fill(1, size(mpg, 1)), mpg.Displ), mpg.Hwy, dropcollinear=true);

julia> x_new = hcat(fill(1, 100), range(extrema(mpg.Displ)..., length=100));

julia> predict(m, x_new, interval = :confidence)
(prediction = [30.048708961974327, 29.856131390728752, 29.663553819483173, 29.470976248237594, 29.27839867699202, 29.08582110574644, 28.89324353450086, 28.700665963255283, 28.508088392009704, 28.315510820764125    12.716727549872292, 12.524149978626713, 12.331572407381135, 12.138994836135556, 11.946417264889977, 11.753839693644398, 11.561262122398823, 11.368684551153244, 11.176106979907665, 10.983529408662086], lower = [28.132450862400983; 27.86263756497239;  ; 1.6767660258642447; 1.4067797464653111], upper = [31.96496706154767; 31.849625216485116;  ; 20.675447933951084; 20.56027907085886])

julia> m1 = lm(hcat(fill(1, size(mpg, 1)), mpg.Displ), mpg.Hwy, dropcollinear=false);

julia> predict(m1, x_new, interval = :confidence)
(prediction = [30.0487089619743, 29.85613139072872, 29.663553819483145, 29.470976248237566, 29.27839867699199, 29.08582110574641, 28.893243534500833, 28.700665963255258, 28.508088392009682, 28.315510820764104    12.716727549872338, 12.524149978626763, 12.331572407381184, 12.138994836135605, 11.94641726489003, 11.753839693644451, 11.561262122398876, 11.368684551153297, 11.176106979907722, 10.983529408662143], lower = [29.177681481115847; 29.002237270865304;  ; 9.75613850053743; 9.543944786597844], upper = [30.91973644283275; 30.710025510592136;  ; 12.596075459278014; 12.423114030726442])

Visually:

confidence

I think the fix is trivial: we just need to take into account that rows/columns have been pivoted:

    R = chol isa CholeskyPivoted ? chol.U[chol.piv, chol.piv] : chol.U