IQSS / Zelig

Consider these two models (from the tobit vignette)

m1 <- survreg(Surv(durable, durable>0, type='left') ~ age + quant,
              data=tobin, dist='gaussian')

z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin)

coefs and vcov are the same:

all.equal(vcov(z.out)[[1]], vcov(m1))
## TRUE
all.equal(coef(z.out), coef(m1))
## TRUE

but the expected values as very different

predict(m1,
        newdata = data.frame(lapply(model.frame(m1)[-1], mean)),
        se.fit = TRUE)
## $fit
##        1 
## -2.067679 
##
## $se.fit
##        1 
## 1.962933
ev <- sim(setx(z.out))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))
##      mean        sd 
## 1.5536944 0.6302974

Do we expect these to be so different? If so, why?

We wrap 'AER' ( https://stats.stackexchange.com/questions/149091/censored-regression-in-r). We should take a closer look, but it might boil down to predicting Y vs Y*.

…

On Thu, Nov 9, 2017 at 8:59 PM, Ista Zahn ***@***.***> wrote: Consider these two models (from the tobit vignette) m1 <- survreg(Surv(durable, durable>0, type='left') ~ age + quant, data=tobin, dist='gaussian') z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin) coefs and vcov are the same: all.equal(vcov(z.out)[[1]], vcov(m1)) ## TRUE all.equal(coef(z.out), coef(m1)) ## TRUE but the expected values as very different predict(m1, newdata = data.frame(lapply(model.frame(m1)[-1], mean)), se.fit = TRUE) ## $fit ## 1 ## -2.067679 ## ## $se.fit ## 1 ## 1.962933 ev <- sim(setx(z.out))$get_qi(qi = "ev") c(mean = mean(ev), sd = sd(ev)) ## mean sd ## 1.5536944 0.6302974 Do we expect these to be so different? If so, why? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#314>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD-wZS4P2Hguu997yScPFP3RsTzaoAuGks5s064HgaJpZM4QY9mo> .

Before digging into this further, is this issue adequately addressed by #315?

No, expected values for tobit models are not as documented. Either the documentation or the implimentation is wrong. I suspect the implimentation. On Nov 11, 2017 5:05 AM, "Christopher Gandrud" <notifications@github.com> wrote: Before digging into this further, is this issue adequately addressed by #315 <#315>? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#314 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATm0zy2X2d595XJKx1i4H9FIImg5mP2ks5s1XGBgaJpZM4QY9mo> .

Thanks for the info. I'll take a look.

For reference:

Zelig/R/model-tobit.R

Lines 148 to 163 in 0f4e1f1

    
           ztobit$methods( 
        
             qi = function(simparam, mm) { 
        
               Coeff <- simparam$simparam %*% t(mm) 
        
               SD <- simparam$simalpha 
        
               alpha <- simparam$simalpha 
        
               lambda <- dnorm(Coeff / SD) / (pnorm(Coeff / SD)) 
        
               ev <- pnorm(Coeff / SD) * (Coeff + SD * lambda) 
        
               pv <- ev 
        
               pv <- matrix(nrow = nrow(ev), ncol = ncol(ev)) 
        
               for (j in 1:ncol(ev)) { 
        
                 pv[, j] <- rnorm(nrow(ev), mean = ev[, j], sd = SD) 
        
                 pv[, j] <- pmin(pmax(pv[, j], .self$below), .self$above) 
        
               } 
        
               return(list(ev = ev, pv = pv)) 
        
             } 
        
           )

This is a hasty test to see if the Y, rather than Y* is the issue. The example and Y calculation procedure is taken directly from: https://stats.stackexchange.com/a/149529

# Create data 
N = 10
f = rep(c("s1","s2","s3","s4","s5","s6","s7","s8"),N)
fcoeff = rep(c(-1,-2,-3,-4,-3,-5,-10,-5),N)
set.seed(100) 
x = rnorm(8*N)+1
beta = 5
epsilon = rnorm(8*N,sd = sqrt(1/5))
y.star = x*beta+fcoeff+epsilon ## latent response
y = y.star 
y[y<0] <- 0 ## censored response
test_data <- data.frame(y = y, zero = 0, x = x, f = f)

# Estimate with AER -------------------------------------------------------
library(AER)
fit <- tobit(y~ 0 + x + f, data = test_data)

# E[Y*] ---------------
mean(predict(fit))

## [1] 0.9823243

# E[Y] -----------
mu <- fitted(fit)
sigma <- fit$scale
p0 <- pnorm(mu/sigma)
lambda <- function(x) dnorm(x)/pnorm(x)
ey0 <- mu + sigma * lambda(mu/sigma)
ey <- p0 * ey0
c(mean = mean(ey, na.rm = T), sd = sd(ey, na.rm = T))

##    mean       sd 
## 2.544692 3.354458 

# Estimate with Zelig -------------------------------------------------------
zfit <- zelig(y~ 0 + x + f, data = test_data, model = "tobit", cite = FALSE)

# E[Y*] ---------------
mean(unlist(predict(zfit)))

## [1] 0.9823243

ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

# E[Y] ???? -----------
ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

##     mean        sd 
## 0.6255104 0.1359505

Note that in this example:

mean(y)

## [1] 2.51392

sd(y)

## [1] 3.353757

which is almost identical to the E[Y] using Achim's procedure from Stack Exchange, but very different from Zelig. The ev from Zelig is also fairly different from the E[Y*] from predict.

@cchoirat am I missing something about what Zelig is trying to achieve (or misinterpreted the results from this test)?

	ztobit$methods(
	qi = function(simparam, mm) {
	Coeff <- simparam$simparam %*% t(mm)
	SD <- simparam$simalpha
	alpha <- simparam$simalpha
	lambda <- dnorm(Coeff / SD) / (pnorm(Coeff / SD))
	ev <- pnorm(Coeff / SD) * (Coeff + SD * lambda)
	pv <- ev
	pv <- matrix(nrow = nrow(ev), ncol = ncol(ev))
	for (j in 1:ncol(ev)) {
	pv[, j] <- rnorm(nrow(ev), mean = ev[, j], sd = SD)
	pv[, j] <- pmin(pmax(pv[, j], .self$below), .self$above)
	}
	return(list(ev = ev, pv = pv))
	}
	)

Tobit expected values incorrect?