model-based vcovDA's with absorption of small blocks

Question

model-based vcovDA's with absorption of small blocks

benthestatistician opened this issue a year ago · comments

Consider the situation where lmitt.f() has absorbed block effects, including those of potentially small blocks, and we wish to present model-based SEs. Stock meat matrix estimators may lack consistency, for the same reason that in the Neyman-Scott problem maximum likelihood estimates of variance lack consistency. Perhaps the problem can be remedied with a degrees-of-freedom adjustment, perhaps a stock adjustment of this type. Or perhaps it calls for something new.

A primary challenge is to articulate meat matrix estimates $\hat{B}$ that are consistent for $B$. Let $\hat{B}(\tau)$ and ${B}(\tau)$ be estimated and true meat matrices with effect parameters held fixed at $\tau$. I conjecture that there are scale corrections to entries of the stock $\hat{B}(\tau)$, perhaps but not necessarily the same across the matrix, leading to a corrected estimator $\tilde{B}(\tau)$ for which $\mathbb{E}[\tilde{B}(\tau)] = B(\tau)$ (from which consistency of $\tilde{B}(\hat\tau)$ should follow under weak additional assumptions).

If all blocks are known to be small, some econometricians & statisticians recommend "clustering on" the block, as discussed Cameron and Miller (2015, III.B on p. 330), in tandem with certain degrees-of-freedom adjustments. We might consider methods that do this for small blocks but not for large blocks. Then a part of the method would have to be a rule with which to decide what's a large and what's a small block. I don't think it would make sense to spend time on this before first having a crack at the primary problem, namely estimating covariance against precisely the design articulated by the user, with units of assignment as clusters.

Ben · Answer 1 · Thu Apr 13 2023 05:53:23 GMT+0800 (China Standard Time)

Some connections:

#85 introduced a user option for clustering on block, as some sources recommend for paired data. In contrast, this issue regards our default block accommodations.
In Cameron and Miller's (2015) discussion of the clustering on the block approach, they criticize Stata's absorb option as leading to inconsistent estimates of variance. From p.331:

Ben · Answer 2 · Fri Apr 14 2023 02:09:57 GMT+0800 (China Standard Time)

Here's an paper addressing closely related issues that appears to be making its way through the meat-grinder econ publication process:
deChaisemartinRamirez-Cuellar22, level of clustering paired experiments.pdf

Ben · Answer 3 · Thu Apr 20 2023 09:59:57 GMT+0800 (China Standard Time)

@xinhew0708 points out that the upsilon estimating function as currently described in estfun_DA.tex (as of 12 Apr '23/ f2c412c) isn't something one can compute from observed data:

This is an issue both for model- and design-based calculations.

Ben · Answer 4 · Thu May 04 2023 23:36:46 GMT+0800 (China Standard Time)

The problem pointed out by Xinhe and noted in my last comment admits a straightfoward solution ( 0a90c07):

jwasserman2 · Answer 5 · Fri Jun 16 2023 07:16:08 GMT+0800 (China Standard Time)

Returning to this, I've done some work based off the Freedman note linked earlier. I've derived an adjustment factor at the stratum level to maximum likelihood estimator of B that makes the scaled estimator unbiased for the covariance matrix. Screenshots are below:

Ben · Answer 6 · Tue Jun 20 2023 05:40:45 GMT+0800 (China Standard Time)

Very nice! You don't happen already to have looked into whether this correction appears in the literature in some form, do you? If not, I'd suggest putting that on your list. If so, I'll be interested to hear what you found, here or offline.

It may be that the matched pairs setting enables a simpler and cleaner story than would be possible for regression analyses with covariates, as well as block effects and/or absorption. If so, then that is conceptual/theoretical advantage of propertee/flexida's separation of covariance modeling and comparative regression that merits discussion in and of itself. For instance, I would think the R/stats people who read and review for software oriented journals would be interested to hear that project undertaken to promote software modularity wound up enabling a nice simplification on the math stats side.

jwasserman2 · Answer 7 · Wed Jun 21 2023 04:16:33 GMT+0800 (China Standard Time)

This adjustment looks to be one of the bias corrections mentioned in Section 6B of Cameron and Miller (2015) and in Section 2 of Bell and McCaffrey (2002), but neither paper makes an extension to stratified samples. The Cameron and Miller paper cites Bester, Conley, and Hansen (2011), which shows that the adjusted estimator for each stratum converges to a T-distribution. Mostly Harmless Econometrics by Angrist and Pischke, also cited in the Cameron and Miller paper, doesn't exactly address this adjustment, but chapter 8 relatedly discusses cluster-robust SE's.

jwasserman2 · Answer 8 · Tue Sep 12 2023 23:22:37 GMT+0800 (China Standard Time)

Propertee's HC1 estimator was never rejecting in my simulations due to an issue with absorb that I haven't yet pinned down. The code below recreates the issue: model 1 is an lm that fits the fixed effects and treatment effect in the same model. The sandwich package's HC1 estimator provides correct estimates because the fixed effects use up degrees of freedom. Model 2 is an lmitt model that fits a prior covariance adjustment model for the fixed effects and uses our cov_adj machinery to propagate uncertainty. Manually multiplying the HC0 estimate by 2, the scaling factor needed to correctly reduce the degrees of freedom, also provides correct esimates. Model 3 is an lmitt model that uses absorb. Its standard errors are much larger than the other two, indicating some bug.

n_strata <- 10
n_obs_per_strata <- 2
n <- n_strata * n_obs_per_strata
stratum_mus <- runif(n_strata, -2, 2)
var_eps <- 0.25
tau <- 0

id <- seq_len(n)
b <- factor(rep(seq_len(n_strata), each = n_obs_per_strata), levels = seq_len(n_strata))
a <- rep(rep(c(0, 1), each = n_obs_per_strata / 2), n_strata)
eps <- rnorm(n)
y <- rep(stratum_mus, each = n_obs_per_strata) + a * tau + eps
dat <- data.frame(id = id, b = b, a = a, y = y)

des <- rct_design(a ~ unitid(id) + block(b), dat)

> # model 1
> lm_fe_mod <- lm(y ~ b + a, dat, weights = ate(des))
> sandwich::sandwich(lm_fe_mod, meat. = sandwich::meatCL, type = "HC1",
+                    cadjust = FALSE)["a", "a"]
[1] 0.1310492
> 
> # model 2
> prior_fe_mod <- lm(y ~ b, dat)
> lmitt_abs_mod1 <- lmitt(y ~ 1, design = des, dat, weights = ate(des),
+                         offset = cov_adj(prior_fe_mod, by = "id"))
> vcovDA(lmitt_abs_mod1, type = "HC0")["a.", "a."] * 2
[1] 0.1306862
> 
> # model 3
> lmitt_abs_mod2 <- lmitt(y ~ 1, design = des, dat, absorb = TRUE, weights = ate(des))
> vcovDA(lmitt_abs_mod2, type = "HC1")["a.", "a."]
[1] 0.7273186

@benthestatistician @josherrickson

jwasserman2 · Answer 9 · Wed Sep 13 2023 00:13:03 GMT+0800 (China Standard Time)

It's pretty clear, upon second thought, that this is due to not de-meaning the outcome variable. I added a line to lmitt that does so and get:

> vcovDA(lmitt_abs_mod2, type = "HC1")["a.", "a."]
[1] 0.1379465

jwasserman2 · Answer 10 · Wed Sep 13 2023 02:03:16 GMT+0800 (China Standard Time)

On the i110_absorb branch, we get an HC1 variance estimate for model 3 that is equivalent to the variance estimate for the fixed effects model:

> set.seed(5150)
> n_strata <- 10
> n_obs_per_strata <- 2
> n <- n_strata * n_obs_per_strata
> stratum_mus <- runif(n_strata, -2, 2)
> var_eps <- 0.25
> tau <- 0
> 
> id <- seq_len(n)
> b <- factor(rep(seq_len(n_strata), each = n_obs_per_strata), levels = seq_len(n_strata))
> a <- rep(rep(c(0, 1), each = n_obs_per_strata / 2), n_strata)
> eps <- rnorm(n)
> y <- rep(stratum_mus, each = n_obs_per_strata) + a * tau + eps
> dat <- data.frame(id = id, b = b, a = a, y = y)
> 
> des <- rct_design(a ~ unitid(id) + block(b), dat)
> 
> # model 1
> lm_fe_mod <- lm(y ~ b + a, dat, weights = ate(des))
> sandwich::sandwich(lm_fe_mod, meat. = sandwich::meatCL, type = "HC1",
+                    cadjust = TRUE)["a", "a"]
[1] 0.3974276
> 
> # model 3
> lmitt_abs_mod2 <- lmitt(y ~ 1, design = des, dat, absorb = TRUE, weights = ate(des))
> vcovDA(lmitt_abs_mod2, type = "HC1")["a.", "a."]
[1] 0.3974276

Ben · Answer 11 · Wed Sep 13 2023 04:13:35 GMT+0800 (China Standard Time)

Good sleuthing about the simulation anomaly!

Question: I take it that "de-meaning" refers to the end of this block of R/lmitt.R, as added in 0cc8adb on branch i110_absorb:

  if (absorb) {
    mm <- apply(mm, 2, areg.center, as.factor(blocks), lm.call$weights)
    data$y <- areg.center(data$y, as.factor(blocks), lm.call$weights)
  }

Is this centering justified by the same reasoning that leads to your small-block d.f. adjustment @jwasserman2? It calls for some form of justification, given that the block means can't be handled via M-estimation in the way other covariance model coefficients are. If it is:

I'd like to underscore that it isn't clear otherwise how it can be motivated. I.e., justifying this is a nice benefit of that argument you've developed.
Should the y's be stratum-centered within the lmitt model, or should that occur in some later place more specific to covariance estimation? I'm not sure. Perhaps the answer would come into focus along with the answer to 3 below.
As the estimating function $\acute{\psi}(\vec{z}; \tau, \mathbf{p}, \beta)$ is currently written (displayed eqn 10/eq:upsilondef), the y's have grand mean centering but not block mean centering. We should update that defintion to reflect appropriate centering, with justification from your argument. I'd like this to be a part of the same branch, so that we get the justification as well as the code when we merge back into main.

jwasserman2 · Answer 12 · Wed Sep 13 2023 20:52:26 GMT+0800 (China Standard Time)

Yep, de-meaning refers to the line you've quoted. My justification is based on equation 5.1.4 on pg. 223 of Mostly Harmless Econometrics stating that accounting for fixed effects also means block-centering $y$. I think I see your point in how this causes an issue with our approach: in the model-based setting, we incur no variance when estimating the block mean propensities, but there will be variance associated with estimating these block mean outcomes. Therefore, we can't just make a degrees of freedom reduction to our otherwise consistent variance estimate; the fixed dimension argument we have been making would fail and consistency would be violated (correct me if I'm misreading).

I will work with this further to see if there are adjustments we can make to the variance estimate that don't involve estimating the block mean outcomes.

Ben · Answer 13 · Wed Sep 13 2023 22:13:05 GMT+0800 (China Standard Time)

I don't think the calculation needs a new or additional adjustment. My understanding of the argument you were making during our meeting is that it justifies the adjustment you made, but not in the usual M-estimation way of dealing with estimated parameters. Rather, it seems to me that you used block-mean centering as a transient device with which to get an unbiased estimate of within-block variance. If that's correct, then what remains to do is to explain how psi functions involving estimated block means and taking the usual M-estimation form give rise to legitmate B matrices and submatrices, i.e. random functions $\hat{B}(\alpha_0, \beta, \rho)$ satisfying $E[\hat{B}(\alpha_0, \beta, \rho)] = {B}(\alpha_0, \beta, \rho)$ that do not have block mean parameters $\alpha_1, \ldots$ among their arguments.

jwasserman2 · Answer 14 · Thu Sep 14 2023 10:07:53 GMT+0800 (China Standard Time)

Yes, block-centering of the outcomes is justified by providing unbiased estimates when we make the correction coded up in that branch, although I want to check the scenario when stratum sizes aren't equal (say, a mix of pairs and triplets). It could be that the correct scaling factor is the stratum-specific $n_{i}/(n_{i}-1)$, but my intuition from Equation 10 on page 7 of the original Neyman-Scott paper leads me to believe it might be a factor calculated across strata, such as $n / (n-n_{strata})$. The latter reduces to the former in the case of equal stratum sizes, but the latter makes more sense to me generally given that the variance estimate is an average across all observations.

Addressing a question you had further up the chain, though I haven't yet run simulations with covariance adjustment, my intuition would be that the residualized outcomes would need to be block-centered.

If that's correct, then what remains to do is to explain how psi functions involving estimated block means and taking the usual M-estimation form give rise to legitmate B matrices and submatrices, i.e. random functions $\hat{B}(\alpha_{0}, \beta, \rho)$ satisfying $E[\hat{B}(\alpha_{0}, \beta, \rho)]=B(\alpha_{0}, \beta, \rho)$ that do not have block mean parameters among their arguments.

As to this point, I will need to spend more time determining how we might do this.

Ben · Answer 15 · Thu Sep 14 2023 22:29:49 GMT+0800 (China Standard Time)

I think we can engineer stratumwise contributions to $\hat{B}(\alpha_{0}, \beta, \rho)$ that are stratumwise unbiased. I expect that this will call for stratum-specific scaling factors, contra Neyman and Scott. (To speculate, perhaps they were bending themselves to address audiences that would expect a single cross-stratum correction.) In fact, I'll wager lunch (if you win) against coffee (if I win) on $n_{b}/(n_{b}-1)$.

This reminds me that I've alluded on several occasions to the U statistic representation of covariance. Here is a post about the U statistic representation of the sample variance; I expect there are straightforward generalizations to covariance, to weighted variance and to weighted covariance, although I keep not getting around to working them out. The relevance to our current discussion is that these demonstrate that it's possible to represent the sample covariance without using the sample mean at all. In practice I think we should stick with the estimating equations we've got, modulo tuning of scaling factors; but in thinking about those random functions $\hat{B}(\alpha_{0}, \beta, \rho)$ that are not permitted access to block-specific intercepts $\alpha_1$, ..., $\alpha_S$, it may be helpful to know that our blockwise contributions to cross-products of estimating functions may appear to involve $\hat{\alpha}_1$, ..., $\hat{\alpha}_S$ but have algebraically equivalent expressions that do not.

jwasserman2 · Answer 16 · Fri Sep 15 2023 00:03:04 GMT+0800 (China Standard Time)

^this is a very helpful comment! I'll get back here with findings on stratum-wise unbiased estimates and what the scaling factors should look like 😄

Ben · Answer 17 · Wed Sep 27 2023 02:07:57 GMT+0800 (China Standard Time)

Here's my derivation of a U statistic type representation for weighted sample variances.
scratch.pdf
Latex source, for good measure:

Let there be observations $x_{1}, x_{2}, \ldots, x_{n}$ and weights $w_{1}, \ldots, w_{n}$. Without loss of generality, $\sum_{i}w_{i}=1$. 
\begin{align*}
  \sum_{i\neq j}\frac{1}{2}w_{i}w_{j}(x_{i}-x_{j})^{2} &= \sum_{i\neq j}\frac{1}{2}(w_{i}x_{i}^{2}w_{j} + w_{j}x_{j}^{2}w_{i} - 2w_{i}x_{i}w_{j}x_{j})\\
                                                       &= \sum_{i}\frac{1}{2}w_{i}(1-w_{i})x_{i}^{2} + \sum_{j}\frac{1}{2}w_{j}(1-w_{j})x_{j}^{2}-\sum_{i\neq j}w_{i}x_{i}w_{j}x_{j}\\
   &= \sum_{i}w_{i}(1-w_{i})x_{i}^{2} -\sum_{i\neq j}w_{i}x_{i}w_{j}x_{j}.
\end{align*}
As $(\sum_{i}w_{i}x_{i})^{2}= \sum_{i}w_{i}^{2}x_{i}^{2} + \sum_{i\neq j}w_{i}x_{i}w_{j}x_{j}$, this gives
\begin{align*}
  \sum_{i\neq j}\frac{1}{2}w_{i}w_{j}(x_{i}-x_{j})^{2}  &= \sum_{i}w_{i}x_{i}^{2} - (\sum_{i}w_{i}x_{i})^{2} \\
  &= \sum_{i}w_{i}x_{i}^{2} -2 (\sum_{i}w_{i}x_{i})^{2} + (\sum_{i}w_{i}x_{i})^{2}\\
                                                        &= \sum_{i}w_{i}x_{i}^{2} -2 \sum_{i}w_{i}x_{i}(\sum_{j}w_{j}x_{j}) + \sum_{i}w_{i}(\sum_{j}w_{j}x_{j})^{2}\\
  &= \sum_{i}w_{i}[x_{i} - (\sum_{j}w_{j}x_{j})]^{2}\\
  &= \operatorname{Var}_{w}(x).
\end{align*}
Similarly, I think,
\begin{equation*}
  \sum_{i\neq j}\frac{1}{2}w_{i}w_{j}(x_{i}-x_{j})(y_{i} - y_{j}) = \sum_{i}w_{i}x_{i}y_{i} - (\sum_{i}w_{i}x_{i}) (\sum_{i}w_{i}y_{i}) = \operatorname{Cov}_{w}(x,y).
\end{equation*}

jwasserman2 · Answer 18 · Sat Sep 30 2023 02:26:21 GMT+0800 (China Standard Time)

If $i$ and $j$ are both in stratum $r$ and $\psi_{i}$ and $\psi_{j}$ both take the form in equation 9, I don't think $\alpha_{r}$ cancels out in the case that $a_{i}\neq a_{j}$. We would have the following:

where the first non-zero term corresponds to the stratum indicator element of the estimating equation, and the other two correspond to the treatment assignment indicators $a_{i}$ and $a_{j}$, respectively. Random functions based on this U-statistic representation would then still be functions of $\alpha_{r}$, unless I'm mistaken here.

Ben · Answer 19 · Sat Sep 30 2023 10:03:11 GMT+0800 (China Standard Time)

Ack, you're quite right. 😖 But, good catch! 👏

Ben · Answer 20 · Sun Oct 01 2023 05:39:55 GMT+0800 (China Standard Time)

If the nuisance parameter doesn't admit of a neat elimination when calculating B matrices, we can still address it by specifically taking it into account. Looking into this a bit, it seems a little messy but probably quite feasible; also I think it will reveal implications for A matrices as well as Bs.

The non-independence of $\psi_i$ and $\psi_j$ that's induced by the \hat{\alpha}_r variable is sufficiently narrow and specific that it still seems quite feasible to compute gradients (& therefore $A_{21}$ and $A_{22}$ estimates) and covariances (& thus $B_{21}$ and $B_{22}$ estimates).

Ben · Answer 21 · Fri Dec 08 2023 07:16:53 GMT+0800 (China Standard Time)

How are standard errors being calculated by the Stata command we've borrowed the concept of an absorb option from, @josherrickson? @jwasserman2 and I could use some hints on figuring out what existing accepted practices are, with what justifications. I went looking and was led to the "Stata commands for one-level fixed effects model" section of the 2012 survey by Dan McCaffrey and coauthors. But at the time that survey was written, it seems, neither xtreg nor areg offered standard errors of any type [p.412].

I assume that the commands in question do now offer a standard error?
Do they offer a clustered standard error option, or any other route to sandwich SEs?
Do the help pages offer any references that might explain their SE calculations, or other hints as to what d.f. or other normalizing corrections they may be including?
Is inspection of source code an option for resolving ambiguities, should we be left with some? Or is that not a thing in Stata?

Ben · Answer 22 · Fri Dec 08 2023 11:26:44 GMT+0800 (China Standard Time)

Couple notes, following some hints @josherrickson shared elsewhere (thanks Josh):

Per the Stata manual's rareg documentation, it's the xtreg, fe command that's appropriate when the # of blocks increases with the sample size.
The xtreg page itself doesn't elaborate on this point, but discusses multiple options for SEs. The default seems to be a stock OLS-type SE.
Multiple sandwich calculations are offered; and are invoked when clustering is declared. The _robust documentation suggests that these give an ordinary cross-product meat matrix with df = (# of clusters) - (# of strata). There are also HC2 type calculations.

Ben · Answer 23 · Sat Dec 09 2023 05:17:25 GMT+0800 (China Standard Time)

A few more questions, the answers to which may call for a look at Stata source code:

In xtreg, fe with absorption and default (classical OLS) SEs, what's the d.f. used for estimating $\sigma^2$ -- n, n-1, n minus the number of levels of the fixed effect variable, or something else?
I think it allows for a clustering variable that's nested within the fixed effect variable. If this isn't mistaken, then I have the same question about d.f. being used when there's a clustering variable: is it the number of clusters, the number of clusters minus 1 or the number of clusters minus the number of levels of the fixed effects variable that was absorbed? (The piece of the _robust documentation I mentioned in my last comment, about robust SEs in sampling calculations, suggested to me that it's the last of these. But the use there of the sampling-specific term "strata" introduces some ambiguity.)

Josh Errickson · Answer 24 · Thu Dec 14 2023 22:18:12 GMT+0800 (China Standard Time)

Without clusters, it looks like xtreg, fe uses the regular OLS df of $n - (p + 1)$ where $p$ is the $p$ that would be obtained from regress with dummies.

For the clustering one, I need to dig into it more. I'll get back to you.

Ben · Answer 25 · Wed Dec 20 2023 23:27:46 GMT+0800 (China Standard Time)

In case it's helpful, my remaining question (no. 2 above) is really a question about whether what's written in on the linked page (14) of Stata's p_robust.pdf/"Robust variance estimation" can be taken to describe xtreg, fe with clusters and absorption, and if so whether the absorbed fixed effects contribute to d.f. in the usual way. Since you've confirmed that without clusters, absorbed fixed effects do contribute to d.f. in the usual way under xtreg, fe, it seems likely that the answer is yes, particularly as the robust variance estimation documentation doesn't draw distinctions between xtreg, fe and anything else.

Josh Errickson · Answer 26 · Wed Dec 20 2023 23:51:05 GMT+0800 (China Standard Time)

Sorry, thanks for flagging this.

The main code is:

	if "`clopt'" != "" {
		local dfe = `df_r' - `df_a'
		ereturn scalar sigma_e	= sqrt(e(rss)/`dfe')
	}
	else {
		ereturn scalar sigma_e	= sqrt(e(rss)/e(df_r))
	}

clopt contains a non-empty string if there are clusters and/or robust SE's are requested; so the if block runs. df_r is the residual DF extracted from a call to regress:

	quietly _regress `y' `XVARS' [`weight'`exp'] if `touse'
	local df_r = e(df_r)

(e(df_r) is kinda the Stata equivalent of mod$df_r.)

df_a is a bit trickier to nail down exactly what it is. _regress is a Stata internal function (kinda like R's primitives, e.g. sum) so I can't view it's source. I ended up messing with xtreg, fe's code to print out what df_a is when first calculated, and in every example I've thrown at it, df_a is the number of clusters, minus 1.

Ben · Answer 27 · Thu Dec 21 2023 00:35:29 GMT+0800 (China Standard Time)

Thanks! Could it be that in the clustered case,

a. The residual d.f. encoded in df_r comes out of a regression of elements that have been summed up to the cluster level, so that it's something like no. clusters minus no. regressors -1; while
b. df_a gives back the number of fixed effects that have been absorbed (perhaps minus 1)?

This stretches what you're reporting to try to get it to line up with p.14 of p_robust.pdf. (a) aligns e(df_r) to $n_c-1$ on that page, whereas (b) pegs df_a to $L$, the number of strata.

jwasserman2 · Answer 28 · Tue Jan 16 2024 03:36:56 GMT+0800 (China Standard Time)

Ben made note of a paper that demonstrates the equivalence between treatment effect estimates when fixed effects are absorbed or included as dummies coincide with treatment effect estimates that both absorb fixed effects and model random effects. I've attached it to this comment
raudenbush09, adaptive centering w random effects.pdf

Ben · Answer 29 · Fri Jan 19 2024 04:05:25 GMT+0800 (China Standard Time)

@josherrickson reports, after some trickly Stata code analysis:

df_r = #clusters - 1
df_a = #absorbed - 1