IQSS / Zelig

A statistical framework that serves as a common interface to a large range of models

Home Page:http://zeligproject.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discrepency between relogit results in Stata and R

christophergandrud opened this issue · comments

Original comment at: #88 (comment)

Sorry @retrography I should have asked earlier, can you provide the data and R code you used to create: #88 (comment)

After some digging, I don't think the large weighting discrepancy has to do with the matrix inversion procedure.

Here are a number of examples using the mid dataset included with Zelig:

In Stata:

. relogit conflict major contig power maxdem mindem years, wc(0.003430204)


Corrected logit estimates                             Number of obs =     3126

------------------------------------------------------------------------------
             |               Robust
    conflict |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       major |   1.672177   .2775353     6.03   0.000     1.128218    2.216137
      contig |   4.016402   .2288064    17.55   0.000      3.56795    4.464855
       power |   .2883612    .414412     0.70   0.487    -.5238715    1.100594
      maxdem |   .0662866   .0191863     3.45   0.001     .0286822    .1038911
      mindem |  -.0814281   .0298655    -2.73   0.006    -.1399634   -.0228928
       years |  -.1170734    .013316    -8.79   0.000    -.1431723   -.0909745
       _cons |  -6.618889   .3164648   -20.92   0.000    -7.239149    -5.99863
------------------------------------------------------------------------------

vs. the two procedures in Zelig (https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L259):

## Zelig 5.1-3 implementation
## Qdiag <- lm.influence(lm(y ~ X - 1, weights = W), do.coef = FALSE)$hat / W

z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
                data = mid, model = "relogit", tau = 1042/303772, cite = FALSE,
                case.control = "weighting")
summary(z.out1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.97463    1.13971  -5.242 1.59e-07
major        1.69910    0.77375   2.196   0.0281
contig       3.87763    0.73955   5.243 1.58e-07
power        0.32808    1.05690   0.310   0.7562
maxdem       0.06009    0.05088   1.181   0.2376
mindem      -0.04737    0.07342  -0.645   0.5188
years       -0.09634    0.04709  -2.046   0.0408
## Old implementation
## Qdiag <- diag(X%*%solve(t(X)%*%diag(W)%*%X)%*%t(X))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.97463    1.13971  -5.242 1.59e-07
major        1.69910    0.77375   2.196   0.0281
contig       3.87763    0.73955   5.243 1.58e-07
power        0.32808    1.05690   0.310   0.7562
maxdem       0.06009    0.05088   1.181   0.2376
mindem      -0.04737    0.07342  -0.645   0.5188
years       -0.09634    0.04709  -2.046   0.0408

Caveat: I'm not sure if the latter is equivalent to the Stata routine. But the following tests provide more evidence that this isn't the cause of the major discrepancy.

Likely root of the problem

There do appear to be two major differences between Zelig and Stata. The first is the calculation of what in Zelig (and the 2001 ISQ article, eq 11) is denoted xi (https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L263):

xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0)

This is used to bias correct the coefficients (https://github.com/IQSS/Zelig/blob/master/R/model-relogit.R#L265)

In Stata (and the 2001 ISQ article, eq. 11) it's calculated with (on relogit.ado line 101):

gen `ksi'= .5 * (`sum') * [(1+`wt1')*`pi'-`wt1']

Where w0 = (1 - tau) / (1 - ybar) and w1 (wt1 in Stata) is found with tau / ybar.

Using the w1 (e.g. wt1) instead of w0 in Zelig produces:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.61889    1.13971  -5.808 6.34e-09
major        1.67218    0.77375   2.161   0.0307
contig       4.01640    0.73955   5.431 5.61e-08
power        0.28836    1.05690   0.273   0.7850
maxdem       0.06629    0.05088   1.303   0.1926
mindem      -0.08143    0.07342  -1.109   0.2674
years       -0.11707    0.04709  -2.486   0.0129

Now the coefficients are almost identical between Stata and Zelig. Of course, the standard errors are still noticeably different. This is because Stata relogit uses robust standard errors by default. If these are not used:

. relogit conflict major contig power maxdem mindem years, wc(0.003430204) norobust

Warning: Traditional variance estimates do not make sense
with the wc() option


Corrected logit estimates                             Number of obs =     3126

------------------------------------------------------------------------------
    conflict |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       major |   1.672177   .7720263     2.17   0.030     .1590337    3.185321
      contig |   4.016402    .737902     5.44   0.000     2.570141    5.462664
       power |   .2883612   1.054539     0.27   0.785    -1.778498     2.35522
      maxdem |   .0662866   .0507657     1.31   0.192    -.0332122    .1657855
      mindem |  -.0814281   .0732608    -1.11   0.266    -.2250166    .0621603
       years |  -.1170734   .0469903    -2.49   0.013    -.2091727   -.0249741
       _cons |  -6.618889    1.13717    -5.82   0.000    -8.847701   -4.390077
------------------------------------------------------------------------------

Now the standard errors and coefficients are almost identical (using the lm.influence matrix inversion routine).

Open issues

  • why is xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0) used in Zelig rather than the Stata version equivalent xi <- 0.5 * Qdiag * ((1 + w1) * pihat - w1)? Should we switch to the latter as this matches eq. 11 in the 2001 ISQ article and Stata relogit?

  • Given the warning produced by Stata relogit when estimating weighted correction without robust standard errors (Warning: Traditional variance estimates do not make sense with the wc() option) and that Zelig doesn't implement robust standard errors, does it make sense to allow weighted correction in Zelig? Or should robust standard errors always be returned for relogit with weighted correction?

I haven't got the time to go back to my data and make a sample for you yet, but it seems you have already made a headway on the issue. These are very interesting findings that will probably make me go and revise all the figures in my paper...

One thing I have noticed is that King & Zeng (2001) insist that weighting is the better method when the data is large and there is a possibility of model misspecification. So, I always picked wc by default, and I was surprised when I moved to Zelig and the default was pc. I did not even notice that the robust standard errors were not implemented in Zelig. This warrants a clear explanation in the documentation at least.

I'm not a statistician, but apparently it is not very difficult to implement the robust standard errors either: https://stats.stackexchange.com/questions/117052/replicating-statas-robust-option-in-r

I confirm that with your modification and the robust standard errors explained in the above link, the results from Stata and R are almost identical:

Original Zelig (1):

> relogit(clo ~ col + fam_sim + desc_sim + age_reused + size_reused + indegree_reused + activity_reused, data = x$sample, tau = x$tau, case.control = "weighting") %>% summary

Call:
relogit(formula = clo ~ col + fam_sim + desc_sim + age_reused + 
    size_reused + indegree_reused + activity_reused, data = x$sample, 
    tau = x$tau, case.control = "weighting")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.91454  -0.00535  -0.00394  -0.00301   0.57550  

Coefficients:
                 Estimate Std. Error z value             Pr(>|z|)    
(Intercept)     -11.78191    1.13907 -10.343 < 0.0000000000000002 ***
colTRUE           4.37017    1.18009   3.703             0.000213 ***
fam_simTRUE       0.43286    0.75625   0.572             0.567063    
desc_sim          5.92926    1.17094   5.064          0.000000411 ***
age_reused       -0.69419    0.45799  -1.516             0.129584    
size_reused       0.35852    0.18196   1.970             0.048795 *  
indegree_reused   0.46649    0.37749   1.236             0.216544    
activity_reused   0.06543    0.20958   0.312             0.754889    
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.62  on 586537  degrees of freedom
Residual deviance: 163.95  on 586530  degrees of freedom
AIC: 32.584

Number of Fisher Scoring iterations: 15

Zelig with the suggested modification --- but no robust (2):

> summary(x$regs$zelig.wc)

Call:
relogit(formula = clo ~ col + fam_sim + desc_sim + age_reused + 
    size_reused + indegree_reused + activity_reused, data = sample, 
    tau = itr$tau, case.control = "weighting")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.73181  -0.00449  -0.00329  -0.00245   0.58745  

Coefficients:
                 Estimate Std. Error z value             Pr(>|z|)    
(Intercept)     -12.33958    0.71156 -17.342 < 0.0000000000000002 ***
colTRUE           4.29274    1.18009   3.638             0.000275 ***
fam_simTRUE       0.52282    0.75625   0.691             0.489356    
desc_sim          5.76395    1.17094   4.923          0.000000854 ***
age_reused       -0.76846    0.45799  -1.678             0.093363 .  
size_reused       0.30820    0.18196   1.694             0.090297 .  
indegree_reused   0.40657    0.37749   1.077             0.281466    
activity_reused   0.06756    0.20958   0.322             0.747172    
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 202.62  on 586537  degrees of freedom
Residual deviance: 163.95  on 586530  degrees of freedom
AIC: 32.584

Number of Fisher Scoring iterations: 15

Zelig with modification and robust estimation (3):

> coeftest(x$regs$zelig.wc, vcov = vcovHC(x$regs$zelig.wc, "HC1"))

z test of coefficients:

                  Estimate Std. Error   z value              Pr(>|z|)    
(Intercept)     -12.339579   0.098383 -125.4236 < 0.00000000000000022 ***
colTRUE           4.292744   0.284716   15.0773 < 0.00000000000000022 ***
fam_simTRUE       0.522820   0.089550    5.8383        0.000000005274 ***
desc_sim          5.763953   0.241278   23.8892 < 0.00000000000000022 ***
age_reused       -0.768462   0.088255   -8.7073 < 0.00000000000000022 ***
size_reused       0.308204   0.020832   14.7944 < 0.00000000000000022 ***
indegree_reused   0.406568   0.094058    4.3225        0.000015423808 ***
activity_reused   0.067561   0.045451    1.4865                0.1372    
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Stata (4):

. relogit clo col fam_sim desc_sim age_reused size_reused indegree_reused activity_reused, wc(0.0000142020456521824) level(99)

Corrected logit estimates                             Number of obs =   586538

------------------------------------------------------------------------------
             |               Robust
         clo |      Coef.   Std. Err.      z    P>|z|     [99% Conf. Interval]
-------------+----------------------------------------------------------------
         col |   4.292744   .2847105    15.08   0.000     3.559379     5.02611
     fam_sim |     .52282   .0895486     5.84   0.000      .292158     .753482
    desc_sim |   5.763953   .2412736    23.89   0.000     5.142473    6.385433
  age_reused |  -.7684625   .0882528    -8.71   0.000    -.9957867   -.5411382
 size_reused |   .3082039    .020832    14.79   0.000     .2545441    .3618636
indegree_r~d |   .4065681   .0940557     4.32   0.000     .1642968    .6488395
activity_r~d |   .0675615   .0454501     1.49   0.137    -.0495102    .1846332
       _cons |  -11.92239   .1782095   -66.90   0.000    -12.38142   -11.46335
------------------------------------------------------------------------------

The question is whether line 263 has to be left the way it is in Zelig, or should it be modified to align with the Stata version?

By the way the standard error of the constant is half that of Stata after robust estimation (and Stata's z-value for constant is half of the one from Zelig with robust standard errors). Which ones to retain?

Note: I have forgotten to mean-center the variables in 2 and 3, hence the difference in the size of the intercept. Otherwise they would be identical.

OK, I found it in the paper. See King & Zeng (2001), Page 147, Formula 11. They clearly mention that it is W1 that we use in the calculation of ksi. The Stata version seems to have the right implementation.

Thanks for digging into this some more.

I need to discuss why the changes were made (both to W1 and removing robust standard errors from Zelig). I know Gary's thinking about the latter has changed, but not sure if the reason for the change is consistent here.

Again, I think it is necessary that we add all the reasoning behind the changes to the documentation so that the users know what they are doing. The documentation still refers to the 2001 paper and I have included parts of the explanations in the paper in my thesis, while now I find out that the implementation does not correspond anymore to what I have cited! Rather problematic if you think about it... I'm looking forward to Gary's response on this, as I will be submitting the thesis and the related paper very soon.

Thanks for your great support!

Returned the coefficient correction to ISQ 2001, eq 11 in 76eed2d

Still thinking about the implementation of robust standard errors, namely should this just be for relogit with weighting or should it be a more general feature.

ef042ad also applies the robust standard errors for summary such that they almost match the Stata version:

Model: 

Call:
relogit(formula = cbind(conflict, 1 - conflict) ~ major + contig + 
    power + maxdem + mindem + years, data = as.data.frame(.), 
    tau = 0.00343020423212146, bias.correct = TRUE, case.control = "weighting")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.94933  -0.04958  -0.02173   0.19019   0.48253  

Coefficients:
            Estimate Std. Error (robust) z value Pr(>|z|)    
(Intercept) -6.61889             0.31748 -20.848  < 2e-16 ***
major        1.67218             0.27842   6.006  1.9e-09 ***
contig       4.01640             0.22954  17.498  < 2e-16 ***
power        0.28836             0.41574   0.694 0.487925    
maxdem       0.06629             0.01925   3.444 0.000573 ***
mindem      -0.08143             0.02996  -2.718 0.006572 ** 
years       -0.11707             0.01336  -8.764  < 2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 143.116  on 3125  degrees of freedom
Residual deviance:  91.178  on 3119  degrees of freedom
AIC: 27.041

Number of Fisher Scoring iterations: 10

compared to Stata:

. relogit conflict major contig power maxdem mindem years, wc(0.003430204)


Corrected logit estimates                             Number of obs =     3126

------------------------------------------------------------------------------
             |               Robust
    conflict |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       major |   1.672177   .2775353     6.03   0.000     1.128218    2.216137
      contig |   4.016402   .2288064    17.55   0.000      3.56795    4.464855
       power |   .2883612    .414412     0.70   0.487    -.5238715    1.100594
      maxdem |   .0662866   .0191863     3.45   0.001     .0286822    .1038911
      mindem |  -.0814281   .0298655    -2.73   0.006    -.1399634   -.0228928
       years |  -.1170734    .013316    -8.79   0.000    -.1431723   -.0909745
       _cons |  -6.618889   .3164648   -20.92   0.000    -7.239149    -5.99863

Still need to implement for sim

Should now be sorted in b3d3e59 including for sim

@christophergandrud I know this issue is already closed and I don't want to repopen it. But I really have to thank you for verifying and rectifying the issue.

No worries @retrography. Thanks for reporting and all of your work on this!

@christophergandrud There is an issue here. If I call relogit through the zelig function the robust standard errors, and the odds ratios work. If I call relogit directly, none works.

Unfortunately stargazer doesn't work with the output from the generic zelig function, so relogit is still needed:

> stargazer(itr$regs$zelig.wc,type="text")
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) :resultis not a valid field or method name for reference classZelig-relogit

Hm, I'll look into it.

In the meantime, please use from_zelig_model, e.g.:

library(stargazer)
library(Zelig)

data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
                data = mid, model = "relogit", tau = 1042/303772)

from_zelig_model(z.out1) %>% stargazer()

@christophergandrud No, it actually doesn't work, because it returns the results from the relogit function, so you get the same old standard errors. See below.

  model <- 
    zelig(
      fun ~ col + fam_sim + desc_sim + age_reused + size_reused + indegree_reused + activity_reused,
      data = sample,
      tau = itr$tau,
      model = "relogit",
      case.control = "weighting",
      bias.correct = T,
      cite = F
    )

Let's summarize:

> summary(model)
Model: 

Call:
relogit(formula = cbind(fun, 1 - fun) ~ col + fam_sim + desc_sim + 
    age_reused + size_reused + indegree_reused + activity_reused, 
    data = as.data.frame(.), tau = 9.16999274066485e-05, bias.correct = TRUE, 
    case.control = "weighting")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.17312  -0.00791  -0.00635  -0.00570   0.81701  

Coefficients:
                 Estimate Std. Error (robust)  z value Pr(>|z|)    
(Intercept)     -10.59231             0.03964 -267.243  < 2e-16 ***
colTRUE           2.41716             0.18265   13.234  < 2e-16 ***
fam_simTRUE       0.11727             0.04957    2.366 0.017984 *  
desc_sim          3.38383             0.18966   17.841  < 2e-16 ***
age_reused       -0.14672             0.02141   -6.854 7.16e-12 ***
size_reused      -0.08268             0.02185   -3.784 0.000154 ***
indegree_reused   1.13638             0.01675   67.825  < 2e-16 ***
activity_reused   0.14772             0.01431   10.320  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1791.6  on 948704  degrees of freedom
Residual deviance: 1306.6  on 948697  degrees of freedom
AIC: 190.5

Number of Fisher Scoring iterations: 13

While:

> summary(from_zelig_model(model))
Call:
relogit(formula = cbind(fun, 1 - fun) ~ col + fam_sim + desc_sim + 
    age_reused + size_reused + indegree_reused + activity_reused, 
    data = as.data.frame(.), tau = 9.16999274066485e-05, bias.correct = TRUE, 
    case.control = "weighting")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.17312  -0.00791  -0.00635  -0.00570   0.81701  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -10.59231    0.21980 -48.190  < 2e-16 ***
colTRUE           2.41716    0.40564   5.959 2.54e-09 ***
fam_simTRUE       0.11727    0.22208   0.528   0.5975    
desc_sim          3.38383    0.74293   4.555 5.25e-06 ***
age_reused       -0.14672    0.08579  -1.710   0.0872 .  
size_reused      -0.08268    0.08930  -0.926   0.3545    
indegree_reused   1.13638    0.07614  14.925  < 2e-16 ***
activity_reused   0.14772    0.06719   2.199   0.0279 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1791.6  on 948704  degrees of freedom
Residual deviance: 1306.6  on 948697  degrees of freedom
AIC: 190.5

Number of Fisher Scoring iterations: 13

Yes, you don't get the odds ratios, that doesn't work yet. You have to manually add them to the output of from_zelig_model(z.out1).

@christophergandrud The problem is not the odds ratios. Compare the two results above. What is missing is the robust standard errors. I am actually using the from_zelig_model function for the second one.

Yes none of it is working. I won't have time to get to this for awhile, so in the meantime if you want to use this function with a separate package you would have to extract the basic estimation and hack your way to the rest. Sorry, I should have been clearer :)

@christophergandrud That's why I am doing right now.

Sorry for the inconvenience and thanks for reporting these issues!