JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Propagate missing values from data when returning vectors

nalimilan opened this issue · comments

When a model was fitted on data containing missing values which have been dropped, predict(model), residuals(model), fitted(model) and response(model) should return a vector of the same length as the data with missing for corresponding entries. It would make sense to change this before releasing GLM 2.0, even if a more general fix is implemented later in StatsModels (JuliaStats/StatsModels.jl#202).

See JuliaStats/MixedModels.jl#650 for a similar issue in MixedModels.

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

This also means that response(model) et al. won't have length nobs(model) (for unweighted models).

A related issue (if a separate issue should be opened for this please let me know, but I think this should be revolved along with the original issue with a consistent approach):

julia> lm(@formula(x1~lag(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

(the issue is that, if I understand things correctly, lagging is applied after dropping of rows with missing is performed)

More generally e.g.:

julia> f(x) = x > 0.5 ? 1 : missing
f (generic function with 1 method)

julia> lm(@formula(x1~f(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

I think that's JuliaStats/StatsModels.jl#202. It's a bit harder to fix.

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

Good point. We could say that functions related to the fitting procedure, like modelmatrix and crossmodelmatrix never include missing values. But I admit that's a bit ad hoc, but that's what R does when using na.exclude (i.e. only pad residuals and predicted values with missing values).

It's a bit harder to fix.

I agree that it is probably harder, but my point was that we should keep also this case in mind when developing a solution (as it might be affected by that requirement).

functions related to the fitting procedure

That arguably includes response, residuals, and fitted... probably everything but predict. "Related to the fitting procedure" may also depend on the kind of model and wouldn't necessarily be the same between GLM and another statistical modeling package.

IMO the current behavior that these functions return the objects/quantities as actually used by the model is the right one, and the case of including any missing values that were present in the input is better handled separately. I actually think that the existing predict(model, data), where data is the original input that includes missing values, is a sensible solution to getting back fitted values that include missing: you're asking for the model's output when given that input, which is inherently different from what the model used for fitting (in the presence of missing values).