JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with collinearity detection on very simple linear model and dataset

mestinso opened this issue · comments

I faced a surprising issue today when using GLM for fitting a basic cubic polynomial. See my script and plot below. For my dataset, if I fit a quadratic polynomial there are no issues, however when I move up to a cubic polynomial, the intercept term is removed (apparently detected to be collinear). The good news is I can set dropcollinear=false as a workaround to resolve the issue, however given such a basic dataset I was very surprised to see this issue in the first place. As another check, I threw this same dataset to R's lm function and it handled it no problem without issue.

using Plots
using DataFrames
using GLM

x=[0.0,16.54252507,36.85132953,58.06647333,85.62460607,123.8759051,174.8138864,238.2577034,312.5741294,385.2595299,451.7838571,523.0189254,575.7680507,621.6300705,677.641035]
y=[2.802571697,2.607979564,2.403339032,2.202060006,2.010878422,1.813030653,1.60853479,1.400739363,1.209780073,1.012102908,0.800961457,0.603279074,0.405503624,0.204338282,0.0]
data = DataFrame(x=x, y=y)

plt = scatter(x,y; label="data")

ols1 = lm(@formula(y ~ x + x^2), data)
plot!(plt, x, predict(ols1); label="quadratic fit")

ols2 = lm(@formula(y ~ x + x^2 + x^3), data)
plot!(plt, x, predict(ols2); label="cubic fit")

ols2fixed = lm(@formula(y ~ x + x^2 + x^3), data; dropcollinear=false)
plot!(plt, x, predict(ols2fixed); label="fixed cubic fit")

image

ols2 variable results:
image

ols2fixed variable resulets:
image

Potentially related issues: #426 and #420
...while this may be related to the previous issues, I feel this warranted a new issue given the severity of the problem with such a simple and standard use case.

Ok, after reviewing #426 further, I think this is indeed a duplicate, sorry about that!

Also, I see that on the dev version, QR is available instead of cholesky. I tried the following and it does seem to resolve the issue with no need to drop the collinear check:

ols2fixedtest = lm(@formula(y ~ x + x^2 + x^3), data; method=:qr)

I think given the severity of the issue, it may make sense to make QR the default or turn off the dropcollinear by default, or something else entirely in order to improve the out of box alignment with other software (in R, python, excel, etc). My line of thinking is that the current defaults are just too conservative and it's apparently too easy to get false positives in the collinearity check...