Propagate missing values from data when returning vectors

Question

Propagate missing values from data when returning vectors

nalimilan opened this issue 10 months ago · comments

Milan Bouchet-Valat commented 10 months ago

When a model was fitted on data containing missing values which have been dropped, predict(model), residuals(model), fitted(model) and response(model) should return a vector of the same length as the data with missing for corresponding entries. It would make sense to change this before releasing GLM 2.0, even if a more general fix is implemented later in StatsModels (JuliaStats/StatsModels.jl#202).

See JuliaStats/MixedModels.jl#650 for a similar issue in MixedModels.

Alex Arslan · Answer 1 · Fri Aug 25 2023 07:47:43 GMT+0800 (China Standard Time)

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

Alex Arslan · Answer 2 · Fri Aug 25 2023 07:55:42 GMT+0800 (China Standard Time)

This also means that response(model) et al. won't have length nobs(model) (for unweighted models).

Bogumił Kamiński · Answer 3 · Mon Aug 28 2023 00:04:14 GMT+0800 (China Standard Time)

A related issue (if a separate issue should be opened for this please let me know, but I think this should be revolved along with the original issue with a consistent approach):

julia> lm(@formula(x1~lag(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

(the issue is that, if I understand things correctly, lagging is applied after dropping of rows with missing is performed)

More generally e.g.:

julia> f(x) = x > 0.5 ? 1 : missing
f (generic function with 1 method)

julia> lm(@formula(x1~f(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

Milan Bouchet-Valat · Answer 4 · Mon Aug 28 2023 01:13:04 GMT+0800 (China Standard Time)

I think that's JuliaStats/StatsModels.jl#202. It's a bit harder to fix.

Milan Bouchet-Valat · Answer 5 · Mon Aug 28 2023 01:19:00 GMT+0800 (China Standard Time)

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

Good point. We could say that functions related to the fitting procedure, like modelmatrix and crossmodelmatrix never include missing values. But I admit that's a bit ad hoc, but that's what R does when using na.exclude (i.e. only pad residuals and predicted values with missing values).

Bogumił Kamiński · Answer 6 · Mon Aug 28 2023 02:16:29 GMT+0800 (China Standard Time)

It's a bit harder to fix.

I agree that it is probably harder, but my point was that we should keep also this case in mind when developing a solution (as it might be affected by that requirement).

Alex Arslan · Answer 7 · Tue Aug 29 2023 01:17:35 GMT+0800 (China Standard Time)

functions related to the fitting procedure

That arguably includes response, residuals, and fitted... probably everything but predict. "Related to the fitting procedure" may also depend on the kind of model and wouldn't necessarily be the same between GLM and another statistical modeling package.

IMO the current behavior that these functions return the objects/quantities as actually used by the model is the right one, and the case of including any missing values that were present in the input is better handled separately. I actually think that the existing predict(model, data), where data is the original input that includes missing values, is a sensible solution to getting back fitted values that include missing: you're asking for the model's output when given that input, which is inherently different from what the model used for fitting (in the presence of missing values).