find_data.svyglm implementation causing problems with missing data
tzoltak opened this issue · comments
If there are missings in variables used in a model, it is estimated on a subset of provided data that is restricted only to observations without missings. However find_data()
S3 method for svyglm models returns a whole dataset with no such a subsetting applied. This causes troubles in margins()
- see. design in margins.svyglm
## load package
library(margins)
library(survey)
data(api)
## code goes here
dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
m <- svyglm(growth ~ target, dstrat)
margins(m, design = dstrat)
## session info for your system
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
[5] LC_TIME=Polish_Poland.1250
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] survey_3.36 survival_2.44-1.1 Matrix_1.2-17 margins_0.3.25
[5] prediction_0.3.14
loaded via a namespace (and not attached):
[1] MASS_7.3-51.4 compiler_3.6.0 DBI_1.0.0 tools_3.6.0
[5] splines_3.6.0 data.table_1.12.2 lattice_0.20-38 mitools_2.4
Solution could be changing line 109 in "find_data.R" to:
data <- model.frame(model)
as model.frame()
returns exactly the data on which model was estimated (and works well with svyglm models). By the way it may also help to reduce memory usage as, in contrast to model[["data"]]
, it returns only columns that are used in a model.
(However making margins()
work in such a situation requires also resolving design in margins.svyglm itself.)
I looked a little closer on "find_data.R" and specifically find_data.default
and I've understood why it was working for glm but not for svyglm. It's about omitting a part that handles na.action in find_data.svyglm
. And also handling subset should be there as well (however without searching at parent frame as a second step). So I think find_data.svyglm
should look like this:
find_data.svyglm <- function(model, ...) {
data <- model[["data"]]
# handle subset
if (!is.null(model[["call"]][["subset"]])) {
subs <- try(eval(model[["call"]][["subset"]], data), silent = TRUE)
if (inherits(subs, "try-error")) {
subs <- TRUE
warning("'find_data()' cannot locate variable(s) used in 'subset'")
}
data <- data[subs, , drop = FALSE]
}
# handle na.action
if (!is.null(model[["na.action"]])) {
data <- data[-model[["na.action"]], , drop = FALSE]
}
data
}