JuliaStats / GLM.jl

filip.csv

In response to industrial concerns about the numerical accuracy of computations from statistical software, the Statistical Engineering and Mathematical and Computational Sciences Divisions of NIST's Information Technology Laboratory are providing datasets with certified values for various statistical methods.
One of them for linear regression is the Filip dataset.
https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml

For this dataset, the lm function with the QR decomposition method fails to estimate all parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                    Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   8.13378        9.54815        0.85    0.3971  -10.9001        27.1677
x             0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        -7.14418       14.9533        -0.48    0.6343  -36.953         22.6647
x ^ 3        -4.53337       14.5308        -0.31    0.7560  -33.5           24.4333
x ^ 4        -0.881151       6.84421       -0.13    0.8979  -14.5248        12.7625
x ^ 5         0.135741       1.93585        0.07    0.9443   -3.7233         3.99479
x ^ 6         0.0989815      0.351359       0.28    0.7790   -0.601441       0.799403
x ^ 7         0.0207993      0.0413984      0.50    0.6169   -0.0617269      0.103326
x ^ 8         0.0022362      0.00307053     0.73    0.4688   -0.0038848      0.0083572
x ^ 9         0.000124681    0.000130509    0.96    0.3426   -0.000135484    0.000384846
x ^ 10.0      2.86391e-6     2.42701e-6     1.18    0.2419   -1.97424e-6     7.70207e-6

whereas the lm function with the Cholesky decomposition method fails to estimate 6 parameters

LinearModel

y ~ 1 + x + :(x ^ 2) + :(x ^ 3) + :(x ^ 4) + :(x ^ 5) + :(x ^ 6) + :(x ^ 7) + :(x ^ 8) + :(x ^ 9) + :(x ^ 10)

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                   Coef.     Std. Error       t  Pr(>|t|)      Lower 95%      Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0          NaN            NaN       NaN     NaN            NaN
x            0.0          NaN            NaN       NaN     NaN            NaN
x ^ 2        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 3        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 4        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 5        0.0          NaN            NaN       NaN     NaN            NaN
x ^ 6        0.00368644     0.000219028   16.83    <1e-26    0.0032503      0.00412258
x ^ 7        0.00191731     0.000127438   15.05    <1e-23    0.00166355     0.00217107
x ^ 8        0.000375885    2.75293e-5    13.65    <1e-21    0.000321067    0.000430703
x ^ 9        3.28191e-5     2.61667e-6    12.54    <1e-19    2.76087e-5     3.80296e-5
x ^ 10       1.07467e-6     9.23624e-8    11.64    <1e-17    8.90753e-7     1.25859e-6
───────────────────────────────────────────────────────────────────────────────────────

The following table shows that the estimates from the lm function in R are close to the certified values.

	Certified Values	R Values	Julia (QR)	Julia (Cholesky)
Beta_0	-1467.4896142298000000	-1467.4895242210000000	8.1337808176521600	0.0000000000000000
Beta_1	-2772.1795919334200000	-2772.1794222850000000	0.0000000000000000	0.0000000000000000
Beta_2	-2316.3710816089300000	-2316.3709402080000000	-7.1441779010897300	0.0000000000000000
Beta_3	-1127.9739409837200000	-1127.9738723330000000	-4.5333687533072200	0.0000000000000000
Beta_4	-354.4782337033490000	-354.4782121956000000	-0.8811512968692300	0.0000000000000000
Beta_5	-75.1242017393757000	-75.1241971938900000	0.1357405515617050	0.0000000000000000
Beta_6	-10.8753180355343000	-10.8753173788800000	0.0989814591132414	0.0036864417075102
Beta_7	-1.0622149858894700	-1.0622149218240000	0.0207993361386816	0.0019173118731864
Beta_8	-0.0670191154593408	-0.0670191114167300	0.0022361990171227	0.0003758850326059
Beta_9	-0.0024678107827548	-0.0024678106336760	0.0001246810033160	0.0000328191326413
Beta_10	-0.0000402962525080	-0.0000402962500667	0.0000028639148280	0.0000010746699638

Moreover,

df_filip = CSV.read("filip.csv", DataFrame);
model_filip_qr = lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr,);
X =  modelmatrix(model_fil_qr);
y = response(model_fil_qr);

rank(X)
# 10
rank(X'X)
# 5

It might be because of the limitation of LINPACK.
Now let's try the same with BigFloat:

using GenericLinearAlgebra
X =  big.(modelmatrix(model_filip_qr))
rank(X)
# 11
rank(X'X)
# 11

and finally,

inv(X'X)*X'y

11-element Vector{BigFloat}:
 -1467.4896101225341941
 -2772.1795838225322676
 -2316.3710745083318604
 -1127.9739373552426084
  -354.47823250481932410
   -75.1242014719792427
   -10.8753179947227183
    -1.0622149816811447
    -0.0670191151786945
    -0.0024678107718217
    -4.0296252319e-05

I don't think it's surprising that a coefficient is dropped since the default is dropcollinear = true and

julia> cond(X)
1.7679681910558162e15

It's surprising, though, that we error out when setting dropcollinear = false

julia> lm(@formula(y~1+x+x^2+x^3+x^4+x^5+x^6+x^7+x^8+x^9+x^10.0), df_filip; method=:qr, dropcollinear=false)
ERROR: RankDeficientException(10)
Stacktrace:
 [1] delbeta!(p::GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}, r::Vector{Float64})
   @ GLM ~/.julia/dev/GLM/src/linpred.jl:66
 [2] fit!(obj::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredQR{Float64, LinearAlgebra.QRCompactWY}})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:97
 [3] fit(::Type{…}, f::FormulaTerm{…}, data::DataFrame, allowrankdeficient_dep::Nothing; wts::Nothing, dropcollinear::Bool, method::Symbol, contrasts::Dict{…})
   @ GLM ~/.julia/dev/GLM/src/lm.jl:0
 [4] fit
   @ ~/.julia/dev/GLM/src/lm.jl:149 [inlined]
 [5] lm
   @ ~/.julia/dev/GLM/src/lm.jl:179 [inlined]
 [6] top-level scope
   @ REPL[42]:1
Some type information was truncated. Use `show(err)` to see complete types.

The problem is the check in

GLM.jl/src/linpred.jl

Line 66 in e2f6c98

rnk == length(p.delbeta) || throw(RankDeficientException(rnk))

which I don't think is a good idea. If the user has set dropcollinear = false then we shouldn't try to outsmart her by computing a numerical rank. We should just check that there isn't a zero element in the diagonal of R.

However, to my surprise, it looks like fixing this still isn't sufficient to match the reference estimates. It looks like LAPACK's pivoted QR loses some precision relative to the standard QR.

julia> qr(X, ColumnNorm()) \ df_filip.y
11-element Vector{Float64}:
  9.01342654713381
  1.6525459423443027
 -5.767606630529379
 -3.8636659613827122
 -0.6703658653069542
  0.1806043161686472
  0.10552342922775357
  0.0214449396217032
  0.0022774832947497475
  0.00012622643173822303
  2.8896433329375396e-6

julia> qr(X) \ df_filip.y
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

Q,R = qr(X);
diag(R)
11-element Vector{Float64}:
   -9.055385138137417
  -13.532654650686624
  -21.825229067114073
   30.328064942426202
   44.4823844079297
   61.77383426163982
   90.2629854477562
 -127.0559597329256
  186.65576017620813
  253.04777394199888
 -373.39818100600036

rank(R)
10

I was also a bit surprised when I checked the diagonal elements of R and rank(R).
In fact, the inv function generates the inverse of R

Rinv = inv(R)
Rinv * Rinv' * X'y
11-element Vector{Float64}:
 -1467.489589088171
 -2772.1795409834594
 -2316.371035859374
 -1127.973917018297
  -354.4782255940765
   -75.12419988749059
   -10.875317746475703
    -1.0622149554340137
    -0.06701911338591168
    -0.0024678107003603396
    -4.029625105615225e-5

which is again a little bit different from what is generated by qr(X)\y.

The QR based least squares solution is

julia> F = qr(X);

julia> F.R\(F.Q'*df_filip.y)[1:size(F, 2)]
11-element Vector{Float64}:
 -1467.4896018652016
 -2772.179569812206
 -2316.3710641177045
 -1127.9739329347135
  -354.4782313151358
   -75.12420126161706
   -10.875317970213466
    -1.0622149798557083
    -0.06701911509851412
    -0.002467810770121917
    -4.029625231108671e-5

Oh. Now I realize what is going on in the pivoted case. There is a rank determination by default, so the relevant comparison is

julia> ldiv!(qr(X, ColumnNorm()), df_filip.y[:,:], 0.0)[1][1:size(X, 2)]
11-element Vector{Float64}:
 -1467.4896118403967
 -2772.1795879957995
 -2316.371078886713
 -1127.9739399822226
  -354.4782335067247
   -75.12420172645771
   -10.875318038405625
    -1.062214986693476
    -0.06701911554715413
    -0.002467810787511543
    -4.029625261329208e-5

where the third argument to ldiv! is the tolerance used when determining the numerical rank. To conclude, I think we just need to get rid of/adjust the check in

GLM.jl/src/linpred.jl

Line 66 in e2f6c98

rnk == length(p.delbeta) || throw(RankDeficientException(rnk))

then we'd be able to match the reference estimates when setting dropcollinear = false.

GLM fails to match certified values for the Filip dataset