leeper / prediction

Tidy, Type-Safe 'prediction()' Methods

Home Page:https://cran.r-project.org/package=prediction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

find_data.svyglm implementation causing problems with missing data

tzoltak opened this issue · comments

If there are missings in variables used in a model, it is estimated on a subset of provided data that is restricted only to observations without missings. However find_data() S3 method for svyglm models returns a whole dataset with no such a subsetting applied. This causes troubles in margins() - see. design in margins.svyglm

## load package
library(margins)
library(survey)
data(api)

## code goes here
dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
m <- svyglm(growth ~ target, dstrat)
margins(m, design = dstrat)

## session info for your system
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=Polish_Poland.1250    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] survey_3.36       survival_2.44-1.1 Matrix_1.2-17     margins_0.3.25   
[5] prediction_0.3.14

loaded via a namespace (and not attached):
[1] MASS_7.3-51.4     compiler_3.6.0    DBI_1.0.0         tools_3.6.0      
[5] splines_3.6.0     data.table_1.12.2 lattice_0.20-38   mitools_2.4    

Solution could be changing line 109 in "find_data.R" to:

data <- model.frame(model)

as model.frame() returns exactly the data on which model was estimated (and works well with svyglm models). By the way it may also help to reduce memory usage as, in contrast to model[["data"]], it returns only columns that are used in a model.

(However making margins() work in such a situation requires also resolving design in margins.svyglm itself.)

I looked a little closer on "find_data.R" and specifically find_data.default and I've understood why it was working for glm but not for svyglm. It's about omitting a part that handles na.action in find_data.svyglm. And also handling subset should be there as well (however without searching at parent frame as a second step). So I think find_data.svyglm should look like this:

find_data.svyglm <- function(model, ...) {
    data <- model[["data"]]
    # handle subset
    if (!is.null(model[["call"]][["subset"]])) {
        subs <- try(eval(model[["call"]][["subset"]], data), silent = TRUE)
        if (inherits(subs, "try-error")) {
            subs <- TRUE
            warning("'find_data()' cannot locate variable(s) used in 'subset'")
        }
        data <- data[subs, , drop = FALSE]
    }
    # handle na.action
    if (!is.null(model[["na.action"]])) {
        data <- data[-model[["na.action"]], , drop = FALSE]
    }
    data
}