predict gives incorrect CIs when collinear columns have been dropped
nalimilan opened this issue · comments
As noted at #410, predict
appears to give incorrect results when some predictors are dropped because of collinearity. We should really do something about this.
julia> x = [1.0 1.0 2.0
1.0 2.0 3.0
1.0 -1.0 0.0];
julia> y = [1.0, 3.0, -2.0];
julia> m1 = lm(x, y, dropcollinear=true)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}:
Coefficients:
─────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────
x1 0.0 NaN NaN NaN NaN NaN
x2 2.07143 0.257539 8.04 0.0787 -1.20092 5.34378
x3 -0.428571 0.174964 -2.45 0.2468 -2.65169 1.79455
─────────────────────────────────────────────────────────────────
julia> m2 = lm(x[:, 1:2], y, dropcollinear=false)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:
Coefficients:
────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────
x1 -0.428571 0.174964 -2.45 0.2468 -2.65169 1.79455
x2 1.64286 0.123718 13.28 0.0479 0.070872 3.21484
────────────────────────────────────────────────────────────────
julia> p1 = predict(m1, x, interval=:confidence);
julia> p2 = predict(m2, x[:, 1:2], interval=:confidence);
julia> @test p1.prediction ≈ p2.prediction
Test Passed
julia> @test p1.upper ≈ p2.upper
Test Failed at REPL[254]:1
Expression: p1.upper ≈ p2.upper
Evaluated: [3.354342064529162; 5.728437323463455; 1.6152034019232366] ≈ [3.2437098232940285; 5.727181955909349; 1.200919478058661]
ERROR: There was an error during testing
julia> @test p1.lower ≈ p2.lower
Test Failed at REPL[255]:1
Expression: p1.lower ≈ p2.lower
Evaluated: [-0.925770635957734; -0.014151609177741165; -5.758060544780381] ≈ [-0.8151383947225999; -0.012896241623634452; -5.343776620915804]
ERROR: There was an error during testing
In the worst case we could throw an error when calling predict
in such cases until we have a fix.
(BTW I'm not sure why the intercept is dropped in this example rather than the last column.)
Cc: @piever @pdeffebach @mkborregaard @dmbates @andreasnoack @Nosferican @DilumAluthge
Thanks for highlighting this.
I can confirm that R does not have this behavior.
The error stems from this line where we work with the pivoted cholesky instead of the original matrix from the original regression.
I haven't tracked down exactly how R is handling it. But there is an interesting warning here
p <- object$rank
p1 <- seq_len(p)
piv <- if(p) qr.lm(object)$pivot[p1]
if(p < ncol(X) && !(missing(newdata) || is.null(newdata)))
warning("prediction from a rank-deficient fit may be misleading")
I can trigger the warning
r$> predict(m1, as.data.frame(x), interval = "confidence")
fit lwr upr
1 1.214286 -0.81513839 3.243710
2 2.857143 -0.01289624 5.727182
3 -2.071429 -5.34377662 1.200919
Warning message:
In predict.lm(m1, as.data.frame(x), interval = "confidence") :
prediction from a rank-deficient fit may be misleading
Thanks. So apparently R doesn't know how to handle it either, so we should probably throw an error (I don't like warnings that tell you we return invalid results).