GLM fails to match certified values for the Filip dataset
mousum-github opened this issue · comments
In response to industrial concerns about the numerical accuracy of computations from statistical software, the Statistical Engineering and Mathematical and Computational Sciences Divisions of NIST's Information Technology Laboratory are providing datasets with certified values for various statistical methods
.
One of them for linear regression is the Filip dataset.
https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml
For this dataset, the lm
function with the QR
decomposition method fails to estimate all parameters
LinearModel
y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 8.13378 9.54815 0.85 0.3971 -10.9001 27.1677
x 0.0 NaN NaN NaN NaN NaN
x ^ 2 -7.14418 14.9533 -0.48 0.6343 -36.953 22.6647
x ^ 3 -4.53337 14.5308 -0.31 0.7560 -33.5 24.4333
x ^ 4 -0.881151 6.84421 -0.13 0.8979 -14.5248 12.7625
x ^ 5 0.135741 1.93585 0.07 0.9443 -3.7233 3.99479
x ^ 6 0.0989815 0.351359 0.28 0.7790 -0.601441 0.799403
x ^ 7 0.0207993 0.0413984 0.50 0.6169 -0.0617269 0.103326
x ^ 8 0.0022362 0.00307053 0.73 0.4688 -0.0038848 0.0083572
x ^ 9 0.000124681 0.000130509 0.96 0.3426 -0.000135484 0.000384846
x ^ 10.0 2.86391e-6 2.42701e-6 1.18 0.2419 -1.97424e-6 7.70207e-6
whereas the lm
function with the Cholesky
decomposition method fails to estimate 6 parameters
LinearModel
y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept) 0.0 NaN NaN NaN NaN NaN
x 0.0 NaN NaN NaN NaN NaN
x ^ 2 0.0 NaN NaN NaN NaN NaN
x ^ 3 0.0 NaN NaN NaN NaN NaN
x ^ 4 0.0 NaN NaN NaN NaN NaN
x ^ 5 0.0 NaN NaN NaN NaN NaN
x ^ 6 0.00368644 0.000219028 16.83 <1e-26 0.0032503 0.00412258
x ^ 7 0.00191731 0.000127438 15.05 <1e-23 0.00166355 0.00217107
x ^ 8 0.000375885 2.75293e-5 13.65 <1e-21 0.000321067 0.000430703
x ^ 9 3.28191e-5 2.61667e-6 12.54 <1e-19 2.76087e-5 3.80296e-5
x ^ 10 1.07467e-6 9.23624e-8 11.64 <1e-17 8.90753e-7 1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────
The following table shows that the estimates from the lm
function in R
are close to the certified values.
Certified Values | R Values | Julia (QR) | Julia (Cholesky) | |
---|---|---|---|---|
Beta_0 | -1467.4896142298000000 | -1467.4895242210000000 | 8.1337808176521600 | 0.0000000000000000 |
Beta_1 | -2772.1795919334200000 | -2772.1794222850000000 | 0.0000000000000000 | 0.0000000000000000 |
Beta_2 | -2316.3710816089300000 | -2316.3709402080000000 | -7.1441779010897300 | 0.0000000000000000 |
Beta_3 | -1127.9739409837200000 | -1127.9738723330000000 | -4.5333687533072200 | 0.0000000000000000 |
Beta_4 | -354.4782337033490000 | -354.4782121956000000 | -0.8811512968692300 | 0.0000000000000000 |
Beta_5 | -75.1242017393757000 | -75.1241971938900000 | 0.1357405515617050 | 0.0000000000000000 |
Beta_6 | -10.8753180355343000 | -10.8753173788800000 | 0.0989814591132414 | 0.0036864417075102 |
Beta_7 | -1.0622149858894700 | -1.0622149218240000 | 0.0207993361386816 | 0.0019173118731864 |
Beta_8 | -0.0670191154593408 | -0.0670191114167300 | 0.0022361990171227 | 0.0003758850326059 |
Beta_9 | -0.0024678107827548 | -0.0024678106336760 | 0.0001246810033160 | 0.0000328191326413 |
Beta_10 | -0.0000402962525080 | -0.0000402962500667 | 0.0000028639148280 | 0.0000010746699638 |
Moreover,
df_filip = CSV.read("filip.csv", DataFrame);
model_filip_qr = lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr,);
X = modelmatrix(model_fil_qr);
y = response(model_fil_qr);
rank(X)
# 10
rank(X'X)
# 5
It might be because of the limitation of LINPACK.
Now let's try the same with BigFloat:
using GenericLinearAlgebra
X = big.(modelmatrix(model_filip_qr))
rank(X)
# 11
rank(X'X)
# 11
and finally,
inv(X'X)*X'y
11-element Vector{BigFloat}:
-1467.4896101225341941
-2772.1795838225322676
-2316.3710745083318604
-1127.9739373552426084
-354.47823250481932410
-75.1242014719792427
-10.8753179947227183
-1.0622149816811447
-0.0670191151786945
-0.0024678107718217
-4.0296252319e-05
I don't think it's surprising that a coefficient is dropped since the default is dropcollinear = true
and
julia> cond(X)
1.7679681910558162e15
It's surprising, though, that we error out when setting dropcollinear = false
julia> lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr, dropcollinear=false)
ERROR: RankDeficientException(10)
Stacktrace:
[1] delbeta!(p::GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}, r::Vector{Float64})
@ GLM ~/.julia/dev/GLM/src/linpred.jl:66
[2] fit!(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}})
@ GLM ~/.julia/dev/GLM/src/lm.jl:97
[3] fit(::Type{…}, f::FormulaTerm{…}, data::DataFrame, allowrankdeficient_dep::Nothing; wts::Nothing, dropcollinear::Bool, method::Symbol, contrasts::Dict{…})
@ GLM ~/.julia/dev/GLM/src/lm.jl:0
[4] fit
@ ~/.julia/dev/GLM/src/lm.jl:149 [inlined]
[5] lm
@ ~/.julia/dev/GLM/src/lm.jl:179 [inlined]
[6] top-level scope
@ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.
The problem is the check in
Line 66 in e2f6c98
dropcollinear = false
then we shouldn't try to outsmart her by computing a numerical rank. We should just check that there isn't a zero element in the diagonal of R
.
However, to my surprise, it looks like fixing this still isn't sufficient to match the reference estimates. It looks like LAPACK's pivoted QR loses some precision relative to the standard QR.
julia> qr(X, ColumnNorm()) \ df_filip.y
11-element Vector{Float64}:
9.01342654713381
1.6525459423443027
-5.767606630529379
-3.8636659613827122
-0.6703658653069542
0.1806043161686472
0.10552342922775357
0.0214449396217032
0.0022774832947497475
0.00012622643173822303
2.8896433329375396e-6
julia> qr(X) \ df_filip.y
11-element Vector{Float64}:
-1467.4896018652016
-2772.179569812206
-2316.3710641177045
-1127.9739329347135
-354.4782313151358
-75.12420126161706
-10.875317970213466
-1.0622149798557083
-0.06701911509851412
-0.002467810770121917
-4.029625231108671e-5
Q,R = qr(X);
diag(R)
11-element Vector{Float64}:
-9.055385138137417
-13.532654650686624
-21.825229067114073
30.328064942426202
44.4823844079297
61.77383426163982
90.2629854477562
-127.0559597329256
186.65576017620813
253.04777394199888
-373.39818100600036
rank(R)
10
I was also a bit surprised when I checked the diagonal elements of R and rank(R).
In fact, the inv
function generates the inverse of R
Rinv = inv(R)
Rinv * Rinv' * X'y
11-element Vector{Float64}:
-1467.489589088171
-2772.1795409834594
-2316.371035859374
-1127.973917018297
-354.4782255940765
-75.12419988749059
-10.875317746475703
-1.0622149554340137
-0.06701911338591168
-0.0024678107003603396
-4.029625105615225e-5
which is again a little bit different from what is generated by qr(X)\y
.
The QR based least squares solution is
julia> F = qr(X);
julia> F.R\(F.Q'*df_filip.y)[1:size(F, 2)]
11-element Vector{Float64}:
-1467.4896018652016
-2772.179569812206
-2316.3710641177045
-1127.9739329347135
-354.4782313151358
-75.12420126161706
-10.875317970213466
-1.0622149798557083
-0.06701911509851412
-0.002467810770121917
-4.029625231108671e-5
Oh. Now I realize what is going on in the pivoted case. There is a rank determination by default, so the relevant comparison is
julia> ldiv!(qr(X, ColumnNorm()), df_filip.y[:,:], 0.0)[1][1:size(X, 2)]
11-element Vector{Float64}:
-1467.4896118403967
-2772.1795879957995
-2316.371078886713
-1127.9739399822226
-354.4782335067247
-75.12420172645771
-10.875318038405625
-1.062214986693476
-0.06701911554715413
-0.002467810787511543
-4.029625261329208e-5
where the third argument to ldiv!
is the tolerance used when determining the numerical rank. To conclude, I think we just need to get rid of/adjust the check in
Line 66 in e2f6c98
dropcollinear = false
.