benbhansen-stats / propertee

When lmitt(y~ m) is requested for a categorical variable m, the numbers of observations that contribute to the estimates may be quite different between subgroups. For each subgroup separately, vcovDA() should be given or positioned to calculate the number of clusters containing one or more members of the subgroup. With this information,

If a subgroup-specific HC0 calculation returns a spurious 0 or near-0 due to too few clusters contributing to that subgroup's estimate, we can replace the spurious 0's with NA.
We can provide HC1 calculations that better correct for d.f.

See this comment and preceding discussion in #116 thread.

@josherrickson would you be wiling to assemble examples from that thread into one or more tests of the NA-ing behavior, tests that can be commented out for the time being but that we can have a pointer to in this thread? Also add informative messages if you like. At your convenience.
@jwasserman2 would you be willing to sketch out a potential solution within this thread? It seems to me that the fact that we permit agglomeration of u_o_a's into larger clusters during the vcovDA calculations puts some constraints on what sort of info can be stored and calculated where, and that you're probably the best positioned to lead on this point. At the same time, we've been discussing your interest in writing a software-oriented paper, and it strikes me that # 2 above is the sort of thing that could play nicely¹ in such a paper. At your convenience.

Happy to have resposibilities divided differently, of course.

I don't know whether other packages execute d.f. calculations separately for separate subgroups, but it would be interesting to figure out for a lit review, and it's the sort of offering that could make software and an attending paper more interesting to readers. ↩

Should the NA'ing out of variances happen in vcovDA? Or summary.DA? If in vcovDA, does this issue affect covariances as well?

In vcovDA(), I'd think; and yes, the corresponding covariances should be NA'd out as well.

Or perhaps NaN'd:

vcov(lm(y~x, data=data.frame(y=1:2, x=0:1)))
            (Intercept)   x
(Intercept)         NaN NaN
x                   NaN NaN

In the previous issue you mentioned that we shouldn't guarantee that both lmitt.formula and lmitt.lm produce the same vcov matrix. Should both paths produce these NaNs?

I added tests. Per my last question, I added them for both lmitt.formula and lmitt.lm but can easily remove one if needed.

Thanks @josherrickson. I agree that both paths should create NaNs, or NAs.

The case for this may be a bit stronger on the lmitt.formula() path than the lmitt.lm one, given that we've got more control of the former; it's not necessarily worth elaborate code contortions within thelmitt.lm path.
Also not sure whether it should be NaN or NA. NaN is what base::vcov() and sandwich::vcovHC() would report, but in their cases it may be because of a divide by 0. So it's ambiguous whether that's what the programmer meant to occur. Perhaps we would be signaling our intentions more clearly by populating those slots with NA_real_s.
For the record I see at least one other reference function sharing the current arguably buggy behavior:

sandwich::sandwich(lm(y~x, data=data.frame(y=1:2, x=0:1)))
##'            (Intercept) x
##' (Intercept)           0 0
##' x                     0 0

The case for this may be a bit stronger on the lmitt.formula() path than the lmitt.lm one, given that we've got more control of the former; it's not necessarily worth elaborate code contortions within thelmitt.lm path.

I wrote logic to check for degrees of freedom based on whether the absorbed_moderators slot is populated, which isn't on the lmitt.lm() path. Since we choose not to try and discern which variables are treatment and moderator variables in a user's lm, coding these degrees of freedom checks for lmitt.lm() would require considerable effort I think.

Also not sure whether it should be NaN or NA. NaN is what base::vcov() and sandwich::vcovHC() would report, but in their cases it may be because of a divide by 0. So it's ambiguous whether that's what the programmer meant to occur. Perhaps we would be signaling our intentions more clearly by populating those slots with NA_real_s.

The warning message I've written makes it clear what the issue is, but since we're choosing not to return a value, I think NA_real_ might be more appropriate.

This package, with the accompanying paper (Imbens and Kolesar, 2016), produces adjusted standard errors based on the standard error estimates proposed in Bell and McCaffrey (2002). Calculations for those standard errors are here.

I realize the functionality in the package I linked in the last comment isn't necessary--we're simply looking to prevent technically invalid errors, which my update in the linked PR should do. We can discuss exactly what my update is at our next meeting

Good to know you're thinking about connections to that Imbens and Kolesar work, @jwasserman2, even if doesn't turn out to bear on the objective of this issue.
Noting that at present .check_df_moderator_estimates() uses NaN not NA_real_. Which is a fine place to leave things, to be sure; but, to clarify, I do share your mild pref for NA_real_.

The test suite I wrote at the opening of this issue still has built-in NA'ing to pass tests - but when I remove those, the tests are failing. Things that are really close to 0 aren't being NA'd out:

> vc
                     o_fac1:assigned(des) o_fac2:assigned(des) o_fac4:assigned(des)
o_fac1:assigned(des)         2.017051e-16         4.944069e-17         3.973054e-17
o_fac2:assigned(des)        -1.341016e-17         8.933167e-02         2.750844e-18
o_fac4:assigned(des)         4.208976e-17         2.750844e-18         3.994443e-02

from

propertee/tests/testthat/test.SandwichLayerVariance.R

Line 2021 in bbf1ea2

vc <- vcov(damod)[5:7, 5:7]

Per the opening of the issue, these should be NA'd (or NaN'd). Does this imply the implementation in #127 didn't catch this case, or did the goal move enough that its not the case anymore?

(@benthestatistician I'd be curious what your windows machine returned after running lines 2004-2007, then 2020-2021 from the above file. I assume you're seeing NaN's per #136?)

It seems like it has something to do with updating the lmitt functionality. Why is the standard error for the intercept in cf hereEDIT: NaN'd? The test isn't expecting that. "z._o_fac1" is correctly NA'd though, and the appropriate warning message is provided in the summary(damod) call

It could also be due to updates I've made to estfun. I'll report back after I dive into a bit more

Couple notes...

We'll only roll our own subgroup-specific d.f.'s for the lmitt.formula() path, not for the lm() |> lmitt.lm() path. (Implicit in the above, saying it again here for ease of reference. 89179e6 adds a similar note to inline comments.)
The Imbens-Kolesar (2016) paper and accompanying dfadjust package calculate d.f. in terms of the spectrum of certain matrices, rather that by counting observations are clusters. Is that calculation readily adaptible to presenting subgroup-specific d.f.? If so, that would be an appealing to offer as a software feature. (Note that the IK calculation is adapted to HC2/CR2, so they wouldn't make sense for us to grab unless/until that paradigm becomes our default.)

subgroup-specific d.f. when subgroup estimates are requested

Footnotes