Suggestion for data sets with missing data

Question

Suggestion for data sets with missing data

Irazall opened this issue 3 years ago · comments

Chris-Gabriel Islam commented 3 years ago

Dear developers,

First, thanks for this great packages. I am using it for my Phd-Thesis. Now, I found something very counterintuitive and suggest to change this. This "bug" occurred while doing a logistic regression with missing data. My McFadden-R^2 fell and it took me a while to figure out why. So, I attach you a reproducible example. There you can see that the R^2 is different if you plug in a model with a data set with missings and if you do not.
This is counterintuitive because the glm-function does not distinguish between these two data sets as it deletes already all observations with missing data. So the McFadden-R^2 should not change neither. Mathematically this is because the calculation of observations with missing data is different between the full model and the empty model. So, I suggest to use the function complete.cases before calculating the loglikehood for the two models in order to be more intuitive.
Let me know what you think about this suggestion. Thank you in advance!

Find below my reproducible example. I hope this is correctly done as I am new to reprex

# Delete environment
rm(list = ls())

# Package names
packages <- c("ISLR", "blorr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

# set seed for reproducibility
set.seed(176)

# remove columns not needed for regression
dataset <- subset(Smarket, select = -c(Year, Today))

# define function that creates NAs and execute it
createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}
dataset <- createNAs(dataset, 0.1)

# do first regression without complete cases
glm.fit1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit1)
#> [1] 0.4616006

# do second regression with complete cases
dataset <- dataset[complete.cases(dataset),]
glm.fit2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = dataset, family = binomial)
blr_rsq_mcfadden(glm.fit2)
#> [1] 0.003519264

# NOTE THAT THERE IS A DIFFERENCE BETWEEN THE TWO MC FADDEN R^2!

Aravind Hebbali · Answer 1 · Tue Jun 01 2021 11:51:34 GMT+0800 (China Standard Time)

Hi @Irazall

Thank you very much for bringing this to our attention. Based on your suggestion, we have decided to review blorr API using data sets with missing data and fix the bugs that arise subsequently.