Propagate missing values from data when returning vectors
nalimilan opened this issue · comments
When a model was fitted on data containing missing values which have been dropped, predict(model)
, residuals(model)
, fitted(model)
and response(model)
should return a vector of the same length as the data with missing
for corresponding entries. It would make sense to change this before releasing GLM 2.0, even if a more general fix is implemented later in StatsModels (JuliaStats/StatsModels.jl#202).
See JuliaStats/MixedModels.jl#650 for a similar issue in MixedModels.
Would we do the same for modelmatrix
? If not, the dimensions of response
and modelmatrix
would disagree, which seems pretty awful. If so, crossmodelmatrix
is modelmatrix(model)' * modelmatrix(model)
, which becomes pretty useless if modelmatrix
includes missing values...
This also means that response(model)
et al. won't have length nobs(model)
(for unweighted models).
A related issue (if a separate issue should be opened for this please let me know, but I think this should be revolved along with the original issue with a consistent approach):
julia> lm(@formula(x1~lag(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
(the issue is that, if I understand things correctly, lagging is applied after dropping of rows with missing
is performed)
More generally e.g.:
julia> f(x) = x > 0.5 ? 1 : missing
f (generic function with 1 method)
julia> lm(@formula(x1~f(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
I think that's JuliaStats/StatsModels.jl#202. It's a bit harder to fix.
Would we do the same for
modelmatrix
? If not, the dimensions ofresponse
andmodelmatrix
would disagree, which seems pretty awful. If so,crossmodelmatrix
ismodelmatrix(model)' * modelmatrix(model)
, which becomes pretty useless ifmodelmatrix
includes missing values...
Good point. We could say that functions related to the fitting procedure, like modelmatrix
and crossmodelmatrix
never include missing values. But I admit that's a bit ad hoc, but that's what R does when using na.exclude
(i.e. only pad residuals and predicted values with missing values).
It's a bit harder to fix.
I agree that it is probably harder, but my point was that we should keep also this case in mind when developing a solution (as it might be affected by that requirement).
functions related to the fitting procedure
That arguably includes response
, residuals
, and fitted
... probably everything but predict
. "Related to the fitting procedure" may also depend on the kind of model and wouldn't necessarily be the same between GLM and another statistical modeling package.
IMO the current behavior that these functions return the objects/quantities as actually used by the model is the right one, and the case of including any missing values that were present in the input is better handled separately. I actually think that the existing predict(model, data)
, where data
is the original input that includes missing values, is a sensible solution to getting back fitted values that include missing: you're asking for the model's output when given that input, which is inherently different from what the model used for fitting (in the presence of missing values).