Discrepency between relogit results in Stata and R
christophergandrud opened this issue · comments
Original comment at: #88 (comment)
Sorry @retrography I should have asked earlier, can you provide the data and R code you used to create: #88 (comment)
After some digging, I don't think the large weighting discrepancy has to do with the matrix inversion procedure.
Here are a number of examples using the mid dataset included with Zelig:
In Stata:
. relogit conflict major contig power maxdem mindem years, wc(0.003430204)
Corrected logit estimates Number of obs = 3126
------------------------------------------------------------------------------
| Robust
conflict | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
major | 1.672177 .2775353 6.03 0.000 1.128218 2.216137
contig | 4.016402 .2288064 17.55 0.000 3.56795 4.464855
power | .2883612 .414412 0.70 0.487 -.5238715 1.100594
maxdem | .0662866 .0191863 3.45 0.001 .0286822 .1038911
mindem | -.0814281 .0298655 -2.73 0.006 -.1399634 -.0228928
years | -.1170734 .013316 -8.79 0.000 -.1431723 -.0909745
_cons | -6.618889 .3164648 -20.92 0.000 -7.239149 -5.99863
------------------------------------------------------------------------------
vs. the two procedures in Zelig (https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L259):
## Zelig 5.1-3 implementation
## Qdiag <- lm.influence(lm(y ~ X - 1, weights = W), do.coef = FALSE)$hat / W
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit", tau = 1042/303772, cite = FALSE,
case.control = "weighting")
summary(z.out1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.97463 1.13971 -5.242 1.59e-07
major 1.69910 0.77375 2.196 0.0281
contig 3.87763 0.73955 5.243 1.58e-07
power 0.32808 1.05690 0.310 0.7562
maxdem 0.06009 0.05088 1.181 0.2376
mindem -0.04737 0.07342 -0.645 0.5188
years -0.09634 0.04709 -2.046 0.0408
## Old implementation
## Qdiag <- diag(X%*%solve(t(X)%*%diag(W)%*%X)%*%t(X))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.97463 1.13971 -5.242 1.59e-07
major 1.69910 0.77375 2.196 0.0281
contig 3.87763 0.73955 5.243 1.58e-07
power 0.32808 1.05690 0.310 0.7562
maxdem 0.06009 0.05088 1.181 0.2376
mindem -0.04737 0.07342 -0.645 0.5188
years -0.09634 0.04709 -2.046 0.0408
Caveat: I'm not sure if the latter is equivalent to the Stata routine. But the following tests provide more evidence that this isn't the cause of the major discrepancy.
Likely root of the problem
There do appear to be two major differences between Zelig and Stata. The first is the calculation of what in Zelig (and the 2001 ISQ article, eq 11) is denoted xi
(https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L263):
xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0)
This is used to bias correct the coefficients (https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L265)
In Stata (and the 2001 ISQ article, eq. 11) it's calculated with (on relogit.ado line 101):
gen `ksi'= .5 * (`sum') * [(1+`wt1')*`pi'-`wt1']
Where w0 = (1 - tau) / (1 - ybar)
and w1
(wt1
in Stata) is found with tau / ybar
.
Using the w1
(e.g. wt1
) instead of w0
in Zelig produces:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.61889 1.13971 -5.808 6.34e-09
major 1.67218 0.77375 2.161 0.0307
contig 4.01640 0.73955 5.431 5.61e-08
power 0.28836 1.05690 0.273 0.7850
maxdem 0.06629 0.05088 1.303 0.1926
mindem -0.08143 0.07342 -1.109 0.2674
years -0.11707 0.04709 -2.486 0.0129
Now the coefficients are almost identical between Stata and Zelig. Of course, the standard errors are still noticeably different. This is because Stata relogit
uses robust standard errors by default. If these are not used:
. relogit conflict major contig power maxdem mindem years, wc(0.003430204) norobust
Warning: Traditional variance estimates do not make sense
with the wc() option
Corrected logit estimates Number of obs = 3126
------------------------------------------------------------------------------
conflict | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
major | 1.672177 .7720263 2.17 0.030 .1590337 3.185321
contig | 4.016402 .737902 5.44 0.000 2.570141 5.462664
power | .2883612 1.054539 0.27 0.785 -1.778498 2.35522
maxdem | .0662866 .0507657 1.31 0.192 -.0332122 .1657855
mindem | -.0814281 .0732608 -1.11 0.266 -.2250166 .0621603
years | -.1170734 .0469903 -2.49 0.013 -.2091727 -.0249741
_cons | -6.618889 1.13717 -5.82 0.000 -8.847701 -4.390077
------------------------------------------------------------------------------
Now the standard errors and coefficients are almost identical (using the lm.influence
matrix inversion routine).
Open issues
-
why is
xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0)
used in Zelig rather than the Stata version equivalentxi <- 0.5 * Qdiag * ((1 + w1) * pihat - w1)
? Should we switch to the latter as this matches eq. 11 in the 2001 ISQ article and Stata relogit? -
Given the warning produced by Stata relogit when estimating weighted correction without robust standard errors (
Warning: Traditional variance estimates do not make sense with the wc() option
) and that Zelig doesn't implement robust standard errors, does it make sense to allow weighted correction in Zelig? Or should robust standard errors always be returned forrelogit
with weighted correction?
I haven't got the time to go back to my data and make a sample for you yet, but it seems you have already made a headway on the issue. These are very interesting findings that will probably make me go and revise all the figures in my paper...
One thing I have noticed is that King & Zeng (2001) insist that weighting is the better method when the data is large and there is a possibility of model misspecification. So, I always picked wc by default, and I was surprised when I moved to Zelig and the default was pc. I did not even notice that the robust standard errors were not implemented in Zelig. This warrants a clear explanation in the documentation at least.
I'm not a statistician, but apparently it is not very difficult to implement the robust standard errors either: https://stats.stackexchange.com/questions/117052/replicating-statas-robust-option-in-r
I confirm that with your modification and the robust standard errors explained in the above link, the results from Stata and R are almost identical:
Original Zelig (1):
> relogit(clo ~ col + fam_sim + desc_sim + age_reused + size_reused + indegree_reused + activity_reused, data = x$sample, tau = x$tau, case.control = "weighting") %>% summary
Call:
relogit(formula = clo ~ col + fam_sim + desc_sim + age_reused +
size_reused + indegree_reused + activity_reused, data = x$sample,
tau = x$tau, case.control = "weighting")
Deviance Residuals:
Min 1Q Median 3Q Max
-0.91454 -0.00535 -0.00394 -0.00301 0.57550
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.78191 1.13907 -10.343 < 0.0000000000000002 ***
colTRUE 4.37017 1.18009 3.703 0.000213 ***
fam_simTRUE 0.43286 0.75625 0.572 0.567063
desc_sim 5.92926 1.17094 5.064 0.000000411 ***
age_reused -0.69419 0.45799 -1.516 0.129584
size_reused 0.35852 0.18196 1.970 0.048795 *
indegree_reused 0.46649 0.37749 1.236 0.216544
activity_reused 0.06543 0.20958 0.312 0.754889
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 202.62 on 586537 degrees of freedom
Residual deviance: 163.95 on 586530 degrees of freedom
AIC: 32.584
Number of Fisher Scoring iterations: 15
Zelig with the suggested modification --- but no robust (2):
> summary(x$regs$zelig.wc)
Call:
relogit(formula = clo ~ col + fam_sim + desc_sim + age_reused +
size_reused + indegree_reused + activity_reused, data = sample,
tau = itr$tau, case.control = "weighting")
Deviance Residuals:
Min 1Q Median 3Q Max
-0.73181 -0.00449 -0.00329 -0.00245 0.58745
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.33958 0.71156 -17.342 < 0.0000000000000002 ***
colTRUE 4.29274 1.18009 3.638 0.000275 ***
fam_simTRUE 0.52282 0.75625 0.691 0.489356
desc_sim 5.76395 1.17094 4.923 0.000000854 ***
age_reused -0.76846 0.45799 -1.678 0.093363 .
size_reused 0.30820 0.18196 1.694 0.090297 .
indegree_reused 0.40657 0.37749 1.077 0.281466
activity_reused 0.06756 0.20958 0.322 0.747172
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 202.62 on 586537 degrees of freedom
Residual deviance: 163.95 on 586530 degrees of freedom
AIC: 32.584
Number of Fisher Scoring iterations: 15
Zelig with modification and robust estimation (3):
> coeftest(x$regs$zelig.wc, vcov = vcovHC(x$regs$zelig.wc, "HC1"))
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.339579 0.098383 -125.4236 < 0.00000000000000022 ***
colTRUE 4.292744 0.284716 15.0773 < 0.00000000000000022 ***
fam_simTRUE 0.522820 0.089550 5.8383 0.000000005274 ***
desc_sim 5.763953 0.241278 23.8892 < 0.00000000000000022 ***
age_reused -0.768462 0.088255 -8.7073 < 0.00000000000000022 ***
size_reused 0.308204 0.020832 14.7944 < 0.00000000000000022 ***
indegree_reused 0.406568 0.094058 4.3225 0.000015423808 ***
activity_reused 0.067561 0.045451 1.4865 0.1372
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Stata (4):
. relogit clo col fam_sim desc_sim age_reused size_reused indegree_reused activity_reused, wc(0.0000142020456521824) level(99)
Corrected logit estimates Number of obs = 586538
------------------------------------------------------------------------------
| Robust
clo | Coef. Std. Err. z P>|z| [99% Conf. Interval]
-------------+----------------------------------------------------------------
col | 4.292744 .2847105 15.08 0.000 3.559379 5.02611
fam_sim | .52282 .0895486 5.84 0.000 .292158 .753482
desc_sim | 5.763953 .2412736 23.89 0.000 5.142473 6.385433
age_reused | -.7684625 .0882528 -8.71 0.000 -.9957867 -.5411382
size_reused | .3082039 .020832 14.79 0.000 .2545441 .3618636
indegree_r~d | .4065681 .0940557 4.32 0.000 .1642968 .6488395
activity_r~d | .0675615 .0454501 1.49 0.137 -.0495102 .1846332
_cons | -11.92239 .1782095 -66.90 0.000 -12.38142 -11.46335
------------------------------------------------------------------------------
The question is whether line 263 has to be left the way it is in Zelig, or should it be modified to align with the Stata version?
By the way the standard error of the constant is half that of Stata after robust estimation (and Stata's z-value for constant is half of the one from Zelig with robust standard errors). Which ones to retain?
Note: I have forgotten to mean-center the variables in 2 and 3, hence the difference in the size of the intercept. Otherwise they would be identical.
OK, I found it in the paper. See King & Zeng (2001), Page 147, Formula 11. They clearly mention that it is W1 that we use in the calculation of ksi. The Stata version seems to have the right implementation.
Thanks for digging into this some more.
I need to discuss why the changes were made (both to W1
and removing robust standard errors from Zelig). I know Gary's thinking about the latter has changed, but not sure if the reason for the change is consistent here.
Again, I think it is necessary that we add all the reasoning behind the changes to the documentation so that the users know what they are doing. The documentation still refers to the 2001 paper and I have included parts of the explanations in the paper in my thesis, while now I find out that the implementation does not correspond anymore to what I have cited! Rather problematic if you think about it... I'm looking forward to Gary's response on this, as I will be submitting the thesis and the related paper very soon.
Thanks for your great support!
Returned the coefficient correction to ISQ 2001, eq 11 in 76eed2d
Still thinking about the implementation of robust standard errors, namely should this just be for relogit with weighting or should it be a more general feature.
ef042ad also applies the robust standard errors for summary
such that they almost match the Stata version:
Model:
Call:
relogit(formula = cbind(conflict, 1 - conflict) ~ major + contig +
power + maxdem + mindem + years, data = as.data.frame(.),
tau = 0.00343020423212146, bias.correct = TRUE, case.control = "weighting")
Deviance Residuals:
Min 1Q Median 3Q Max
-0.94933 -0.04958 -0.02173 0.19019 0.48253
Coefficients:
Estimate Std. Error (robust) z value Pr(>|z|)
(Intercept) -6.61889 0.31748 -20.848 < 2e-16 ***
major 1.67218 0.27842 6.006 1.9e-09 ***
contig 4.01640 0.22954 17.498 < 2e-16 ***
power 0.28836 0.41574 0.694 0.487925
maxdem 0.06629 0.01925 3.444 0.000573 ***
mindem -0.08143 0.02996 -2.718 0.006572 **
years -0.11707 0.01336 -8.764 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 143.116 on 3125 degrees of freedom
Residual deviance: 91.178 on 3119 degrees of freedom
AIC: 27.041
Number of Fisher Scoring iterations: 10
compared to Stata:
. relogit conflict major contig power maxdem mindem years, wc(0.003430204)
Corrected logit estimates Number of obs = 3126
------------------------------------------------------------------------------
| Robust
conflict | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
major | 1.672177 .2775353 6.03 0.000 1.128218 2.216137
contig | 4.016402 .2288064 17.55 0.000 3.56795 4.464855
power | .2883612 .414412 0.70 0.487 -.5238715 1.100594
maxdem | .0662866 .0191863 3.45 0.001 .0286822 .1038911
mindem | -.0814281 .0298655 -2.73 0.006 -.1399634 -.0228928
years | -.1170734 .013316 -8.79 0.000 -.1431723 -.0909745
_cons | -6.618889 .3164648 -20.92 0.000 -7.239149 -5.99863
Still need to implement for sim
Should now be sorted in b3d3e59 including for sim
@christophergandrud I know this issue is already closed and I don't want to repopen it. But I really have to thank you for verifying and rectifying the issue.
No worries @retrography. Thanks for reporting and all of your work on this!
@christophergandrud There is an issue here. If I call relogit through the zelig function the robust standard errors, and the odds ratios work. If I call relogit directly, none works.
Unfortunately stargazer doesn't work with the output from the generic zelig function, so relogit is still needed:
> stargazer(itr$regs$zelig.wc,type="text")
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) :
‘result’ is not a valid field or method name for reference class “Zelig-relogit”
Hm, I'll look into it.
In the meantime, please use from_zelig_model
, e.g.:
library(stargazer)
library(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit", tau = 1042/303772)
from_zelig_model(z.out1) %>% stargazer()
@christophergandrud No, it actually doesn't work, because it returns the results from the relogit function, so you get the same old standard errors. See below.
model <-
zelig(
fun ~ col + fam_sim + desc_sim + age_reused + size_reused + indegree_reused + activity_reused,
data = sample,
tau = itr$tau,
model = "relogit",
case.control = "weighting",
bias.correct = T,
cite = F
)
Let's summarize:
> summary(model)
Model:
Call:
relogit(formula = cbind(fun, 1 - fun) ~ col + fam_sim + desc_sim +
age_reused + size_reused + indegree_reused + activity_reused,
data = as.data.frame(.), tau = 9.16999274066485e-05, bias.correct = TRUE,
case.control = "weighting")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.17312 -0.00791 -0.00635 -0.00570 0.81701
Coefficients:
Estimate Std. Error (robust) z value Pr(>|z|)
(Intercept) -10.59231 0.03964 -267.243 < 2e-16 ***
colTRUE 2.41716 0.18265 13.234 < 2e-16 ***
fam_simTRUE 0.11727 0.04957 2.366 0.017984 *
desc_sim 3.38383 0.18966 17.841 < 2e-16 ***
age_reused -0.14672 0.02141 -6.854 7.16e-12 ***
size_reused -0.08268 0.02185 -3.784 0.000154 ***
indegree_reused 1.13638 0.01675 67.825 < 2e-16 ***
activity_reused 0.14772 0.01431 10.320 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1791.6 on 948704 degrees of freedom
Residual deviance: 1306.6 on 948697 degrees of freedom
AIC: 190.5
Number of Fisher Scoring iterations: 13
While:
> summary(from_zelig_model(model))
Call:
relogit(formula = cbind(fun, 1 - fun) ~ col + fam_sim + desc_sim +
age_reused + size_reused + indegree_reused + activity_reused,
data = as.data.frame(.), tau = 9.16999274066485e-05, bias.correct = TRUE,
case.control = "weighting")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.17312 -0.00791 -0.00635 -0.00570 0.81701
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.59231 0.21980 -48.190 < 2e-16 ***
colTRUE 2.41716 0.40564 5.959 2.54e-09 ***
fam_simTRUE 0.11727 0.22208 0.528 0.5975
desc_sim 3.38383 0.74293 4.555 5.25e-06 ***
age_reused -0.14672 0.08579 -1.710 0.0872 .
size_reused -0.08268 0.08930 -0.926 0.3545
indegree_reused 1.13638 0.07614 14.925 < 2e-16 ***
activity_reused 0.14772 0.06719 2.199 0.0279 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1791.6 on 948704 degrees of freedom
Residual deviance: 1306.6 on 948697 degrees of freedom
AIC: 190.5
Number of Fisher Scoring iterations: 13
Yes, you don't get the odds ratios, that doesn't work yet. You have to manually add them to the output of from_zelig_model(z.out1)
.
@christophergandrud The problem is not the odds ratios. Compare the two results above. What is missing is the robust standard errors. I am actually using the from_zelig_model
function for the second one.
Yes none of it is working. I won't have time to get to this for awhile, so in the meantime if you want to use this function with a separate package you would have to extract the basic estimation and hack your way to the rest. Sorry, I should have been clearer :)
@christophergandrud That's why I am doing right now.
Sorry for the inconvenience and thanks for reporting these issues!