PosDefException with simple linear regression when x/y ranges are small

Question

PosDefException with simple linear regression when x/y ranges are small

mortenpi opened this issue 4 years ago · comments

I'm getting a PosDefException: matrix is not positive definite; Cholesky factorization failed when doing a simple lm(@formula(y ~ x), df) when the range of x and y values is tiny compared to their absolute value. Should be possible to replicate with this dataset:

δ(n; q=3) = mod(n, q)/(q-1) - 1
function dataset(N)
    x0, Δx, δx = 13.5, 1e-8, 1e-8
    y0, Δy, δy = 200, 1e-7, 2e-7
    DataFrame(
        x = x0 .+ Δx .* (-N:N) .+ δx .* δ.(1:2N+1; q=3),
        y = y0 .+ Δy .* (-N:N) .+ δy .* δ.(3:2N+3; q=5),
    )
end
df = dataset(10)
lm(@formula(y ~ x), df)

I have a full MWE here as a gist.

I can see how this is numerically an edge case, but I feel that it should ideally just work. For my purposes, I can pretty easily work around the issue by just shifting and scaling the x and y values, fitting and then scaling the coefficients back:

function linreg(df::DataFrame, x::Symbol, y::Symbol)
    xs, ys = df[:,x], df[:,y]
    xmin, xmax = extrema(xs)
    ymin, ymax = extrema(ys)
    _df = DataFrame(x=(xs .- xmin) ./ (xmax - xmin), y=(ys .- ymin) ./ (ymax - ymin))
    m = lm(@formula(y ~ x), _df)
    _b, _k = coef(m)
    k = _k * (ymax - ymin) / (xmax - xmin)
    b = _b * (ymax - ymin) + ymin - xmin*k
    return b, k
end

Seems to give a reasonable result:

Andreas Noack · Answer 1 · Mon Jun 08 2020 15:07:25 GMT+0800 (China Standard Time)

I think this is one of the relatively rare cases where you'd need the extra precision that the QR factorization provides relative to the Cholesky. The functionality for using QR is there but we don't expose it in a nice way so let's keep this one open to track the progress of that.

Morten Piibeleht · Answer 2 · Mon Jun 08 2020 16:00:56 GMT+0800 (China Standard Time)

Would it make sense for the logic to be "Try Cholesky. If error, try QR"? I.e so that the user wouldn't have to do anything?

Andreas Noack · Answer 3 · Tue Jun 09 2020 04:26:13 GMT+0800 (China Standard Time)

So far, we have aimed for type stability and that wouldn't be possible with the approach you are suggesting. I'd prefer that we just provide an easy way to switch to QR and we could maybe also consider suggesting the QR factorization version in the error message thrown when the Cholesky fails.

Ivan Casas Gomez-Uribarri · Answer 4 · Thu Dec 17 2020 18:42:28 GMT+0800 (China Standard Time)

Hi, sorry for reviving this. I am having a similar issue with LinearBinaryClassifier, and I am really lost about why it could be. The error I get when I try to fit the machine (I am using MLJ) is the same as described here:

ERROR: LoadError: TaskFailedException:
PosDefException: matrix is not positive definite; Cholesky factorization failed.

My dataset has 100 rows and 3500 columns, and all features are of similar magnitude. The minimum value across the entire X array is -0.016700 and the maximum value is 0.354000, so I wouldn't say the range is small compared to their absolute value.

Any idea of why this could be happening?

EDIT: I think I know what the reason might be now. Since my X dataset contains the infrared absorbance at several (3500) wavenumbers, it happens that some pairs of wavenumbers have the exact same absorbance for all samples (making them essentially the same variable). This has been solved in a few other issues by setting the third argument of lm() to true.

However, since my model is embedded inside a pipeline, my implementation is to wrap the pipeline in a machine and use fit!().

Is there a way of allowing collinearity in this context?

@load LinearBinaryClassifier pkg=GLM
LinearBinaryClassifierPipe = @pipeline(Standardizer(),
                                       OneHotEncoder(drop_last = true),
                                       LinearBinaryClassifier())
LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)

Milan Bouchet-Valat · Answer 5 · Sun Dec 20 2020 06:13:59 GMT+0800 (China Standard Time)

@casasgomezuribarri You have much more variables than observations so it doesn't make since to fit a GLM. You probably want to apply some kind of dimensionality reduction algorithm like partial least squares.