prediction on a linear model changed between version 1.3.11 and 1.4.0
piever opened this issue · comments
In particular the reported confidence interval is different. On GLM v1.3.11 I get
julia> using GLM
julia> x = [1.0 1.0
1.0 2.0
1.0 -1.0];
julia> y = [1.0, 3.0, -2.0];
julia> m = lm(x, y);
julia> predict(m, x, interval=:confidence)
(prediction = [1.2142857142857142, 2.857142857142857, -2.0714285714285716], lower = [-0.8151383947225999; -0.012896241623634452; -5.343776620915804], upper = [3.2437098232940285; 5.727181955909349; 1.200919478058661])
whereas v1.4.0 gives
julia> using GLM
julia> x = [1.0 1.0
1.0 2.0
1.0 -1.0];
julia> y = [1.0, 3.0, -2.0];
julia> m = lm(x, y);
julia> predict(m, x, interval=:confidence)
(prediction = [1.2142857142857146, 2.8571428571428585, -2.0714285714285725], lower = [-0.8151383947225981; -1.0989330286373797; -5.343776620915804], upper = [3.2437098232940276; 6.813218742923096; 1.2009194780586587])
I don't think it is just a numerical instability, larger examples, eg
julia> x = hcat(fill(1, 100), 1:100);
julia> y = [i + 0.1 * isodd(i) for i in 1:100];
yield even larger differences.
Actually that's because allowrankdeficient
now defaults to true
(and has been renamed to dropcollinear
) since #386. With dropcollinear=false
I get the same on 1.4.0 as on 1.3.11. Not sure exactly what happens under the hood.
@pdeffebach @mkborregaard @dmbates @andreasnoack @Nosferican @DilumAluthge Any ideas why prediction intervals are incorrect when allowrankdeficient=true
/dropcollinear=true
? Coefficients are not affected. I guess this line is the problem:
Line 227 in a587711
Nice detective work! I still think it's problematic that dropcollinear
leads to radically different confidence intervals.
For a more realistic example (without collinear columns):
julia> using RDatasets, GLM
julia> mpg = dataset("ggplot2", "mpg");
julia> m = lm(hcat(fill(1, size(mpg, 1)), mpg.Displ), mpg.Hwy, dropcollinear=true);
julia> x_new = hcat(fill(1, 100), range(extrema(mpg.Displ)..., length=100));
julia> predict(m, x_new, interval = :confidence)
(prediction = [30.048708961974327, 29.856131390728752, 29.663553819483173, 29.470976248237594, 29.27839867699202, 29.08582110574644, 28.89324353450086, 28.700665963255283, 28.508088392009704, 28.315510820764125 … 12.716727549872292, 12.524149978626713, 12.331572407381135, 12.138994836135556, 11.946417264889977, 11.753839693644398, 11.561262122398823, 11.368684551153244, 11.176106979907665, 10.983529408662086], lower = [28.132450862400983; 27.86263756497239; … ; 1.6767660258642447; 1.4067797464653111], upper = [31.96496706154767; 31.849625216485116; … ; 20.675447933951084; 20.56027907085886])
julia> m1 = lm(hcat(fill(1, size(mpg, 1)), mpg.Displ), mpg.Hwy, dropcollinear=false);
julia> predict(m1, x_new, interval = :confidence)
(prediction = [30.0487089619743, 29.85613139072872, 29.663553819483145, 29.470976248237566, 29.27839867699199, 29.08582110574641, 28.893243534500833, 28.700665963255258, 28.508088392009682, 28.315510820764104 … 12.716727549872338, 12.524149978626763, 12.331572407381184, 12.138994836135605, 11.94641726489003, 11.753839693644451, 11.561262122398876, 11.368684551153297, 11.176106979907722, 10.983529408662143], lower = [29.177681481115847; 29.002237270865304; … ; 9.75613850053743; 9.543944786597844], upper = [30.91973644283275; 30.710025510592136; … ; 12.596075459278014; 12.423114030726442])
Visually:
I think the fix is trivial: we just need to take into account that rows/columns have been pivoted:
R = chol isa CholeskyPivoted ? chol.U[chol.piv, chol.piv] : chol.U