IQSS / Zelig

A statistical framework that serves as a common interface to a large range of models

Home Page:http://zeligproject.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tobit expected values incorrect?

izahn opened this issue · comments

Consider these two models (from the tobit vignette)

m1 <- survreg(Surv(durable, durable>0, type='left') ~ age + quant,
              data=tobin, dist='gaussian')

z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin)

coefs and vcov are the same:

all.equal(vcov(z.out)[[1]], vcov(m1))
## TRUE
all.equal(coef(z.out), coef(m1))
## TRUE

but the expected values as very different

predict(m1,
        newdata = data.frame(lapply(model.frame(m1)[-1], mean)),
        se.fit = TRUE)
## $fit
##        1 
## -2.067679 
##
## $se.fit
##        1 
## 1.962933
ev <- sim(setx(z.out))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))
##      mean        sd 
## 1.5536944 0.6302974 

Do we expect these to be so different? If so, why?

Before digging into this further, is this issue adequately addressed by #315?

Thanks for the info. I'll take a look.

For reference:

Zelig/R/model-tobit.R

Lines 148 to 163 in 0f4e1f1

ztobit$methods(
qi = function(simparam, mm) {
Coeff <- simparam$simparam %*% t(mm)
SD <- simparam$simalpha
alpha <- simparam$simalpha
lambda <- dnorm(Coeff / SD) / (pnorm(Coeff / SD))
ev <- pnorm(Coeff / SD) * (Coeff + SD * lambda)
pv <- ev
pv <- matrix(nrow = nrow(ev), ncol = ncol(ev))
for (j in 1:ncol(ev)) {
pv[, j] <- rnorm(nrow(ev), mean = ev[, j], sd = SD)
pv[, j] <- pmin(pmax(pv[, j], .self$below), .self$above)
}
return(list(ev = ev, pv = pv))
}
)

This is a hasty test to see if the Y, rather than Y* is the issue. The example and Y calculation procedure is taken directly from: https://stats.stackexchange.com/a/149529

# Create data 
N = 10
f = rep(c("s1","s2","s3","s4","s5","s6","s7","s8"),N)
fcoeff = rep(c(-1,-2,-3,-4,-3,-5,-10,-5),N)
set.seed(100) 
x = rnorm(8*N)+1
beta = 5
epsilon = rnorm(8*N,sd = sqrt(1/5))
y.star = x*beta+fcoeff+epsilon ## latent response
y = y.star 
y[y<0] <- 0 ## censored response
test_data <- data.frame(y = y, zero = 0, x = x, f = f)

# Estimate with AER -------------------------------------------------------
library(AER)
fit <- tobit(y~ 0 + x + f, data = test_data)

# E[Y*] ---------------
mean(predict(fit))

## [1] 0.9823243

# E[Y] -----------
mu <- fitted(fit)
sigma <- fit$scale
p0 <- pnorm(mu/sigma)
lambda <- function(x) dnorm(x)/pnorm(x)
ey0 <- mu + sigma * lambda(mu/sigma)
ey <- p0 * ey0
c(mean = mean(ey, na.rm = T), sd = sd(ey, na.rm = T))

##    mean       sd 
## 2.544692 3.354458 

# Estimate with Zelig -------------------------------------------------------
zfit <- zelig(y~ 0 + x + f, data = test_data, model = "tobit", cite = FALSE)

# E[Y*] ---------------
mean(unlist(predict(zfit)))

## [1] 0.9823243

ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

# E[Y] ???? -----------
ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

##     mean        sd 
## 0.6255104 0.1359505 

Note that in this example:

mean(y)

## [1] 2.51392

sd(y)

## [1] 3.353757

which is almost identical to the E[Y] using Achim's procedure from Stack Exchange, but very different from Zelig. The ev from Zelig is also fairly different from the E[Y*] from predict.

@cchoirat am I missing something about what Zelig is trying to achieve (or misinterpreted the results from this test)?