jknowles / merTools

Convenience functions for working with merMod objects from lme4

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

marginalize over RE

bachlaw opened this issue · comments

Jared,

Great work as usual. "Given your response to the other issue, I take it that offering the re.form function from predict.merMod just isn't workable at this time, to marginalize over random effects?

thanks,
Jonathan

Currently I don't think that is possible. However, I have been thinking about rewriting the REimpact function, which does this in a different but related way, to allow this.

Can you describe your use case in a bit more detail. Are you looking to get a prediction interval that averages across all the levels of the random effects but at a fixed set of values for the non-random effects? Is it most common that you would want to do this to an entire data frame at once and return (lower, fit, upper) for each row?

This might be a relatively easy fix and would solve a nagging problem I have about why REimpact does not seem to be as useful as I imagined.

Sure Jared.

A baseball example: say I want to know the likely contributions of a batter, pitcher, and catcher to a pitch getting called a strike. I set up 3 random effects / groups, one for all players at each position, for all players in each category and run a glmer. I then want to predict outcomes without various categories so I can get a sense of how an average performer in that category would affect the result, and by extension, tell me the effect of the people in those categories. So it if were batters I wanted to leave out, it would be predict merMod, re.form=~(1|pitcher) + (1|catcher). It amounts to a glorified coefficient extraction, but particularly when a transformation needs to be applied, it is very handy.

You could do the same thing in an educational setting with groups of, say, (1|teacher), (1|school), and (1|district). Let's say you wanted to make predictions using school and district only, to see what effect an average teacher would have versus the normal predictions with the teachers present. You could then easily group the results with and without the teachers to get a mean effect on the outcome of various teachers to see if teacher variation is driving quality of outcomes. (I think Gelman uses an example like this in his multilevel modeling book).

Does this make sense?

Thanks,
Jonathan

Thanks @bachlaw

That does make sense.

Can you tell me what you would expect it to look like in the case where you had a slope and an intercept at the random level: e.g. (1 + class_size | teacher) + (1|district) + (1 + mean_test |school). Would you want the ability to ignore the slope term (class_size)? What about the slopes for school when you are averaging over teachers?

After that, I think I can probably work something out and put this together relatively quickly...

@carlbfrederick Do you have an idea of how to solve this? I think I have seen about a dozen questions along these lines in the past few weeks, but I'm not sure how this is different from the output we provide for REsim.

I am going to code up an example in the branch margins and see what I can see...

@jknowles Hmmm...

Just trying to process here. Conceptually, the problem is trivial, no? The average teacher has a value of 0 for all random coefficients by definition. So, I think it is similar to REimpact (as much as I can remember it). That is how I would initially answer the question in the second paragraph of @bachlaw's most recent comment above. (e.g. compare moving from average teacher (or results "marginalized over" the teacher random effects) to a teacher at x percent of the distribution.

As a coding/usability issue, I am not sure if the question is a new feature or just better or more sophisticated packaging/interaction with REimpact I will think about this over the next few days and check out the margins branch to see if any thoughts come to mind.

Put some links or comments with the other similar questions in some document in the branch too. This might help me see some of the commonality.

This came in an e-mail:

OK, let me describe my situation with a more concrete (but fake) example. Suppose that I have a dataset of SAT scores sampled from 24 high schools in each of the 6 states in the mid-atlantic region. I would like to show two things: 1) the average SAT score for each of the 6 states while controlling the average family income at the overall average income, and 2) the effect of family income for each state.

library('lme4')
mydata <- Penicillin
set.seed(10)
mydata$income <- rnorm(144)
mydata$school <- Penicillin$plate
mydata$state <- Penicillin$sample
mydata$SAT <- Penicillin$diameter

I can analyze the data “mydata” for each state separately. For example, for the first state:

m1 <- lm(SAT ~ income, data=mydata[mydata$state==‘A’,])

and the following results

summary(m1)$coefficients

               Estimate Std. Error     t value     Pr(>|t|)
(Intercept) 25.15231111  0.2171386 115.8352824 3.805796e-32
income      -0.03788689  0.2176436  -0.1740776 8.633964e-01

would allow me to achieve the two goals 1) and 2) above, and plot the results similar to Figure 6 in the Gelman et al. paper.

However, we need to perform multiple comparison correction since the above linear regression analysis is done 6 times (for the 6 states). The Gelman et al. paper argues that multiple comparison correction can be better handled through one (instead of 6) analysis with a multilevel model:

fm <- lmer(SAT ~ income + (income|school) + (1|state), data=mydata)

and then we can somehow obtain the two effects and their standard errors in 1) and 2) for each state as shown in Figure 7 of Gelman et al. It’s not clear to me how they exactly pulled the posterior data for their Figure 7, but the algebra is shown on pages 196-197.

Also this fumbling attempt on SO for me to answer a similar question (I think similar)?

http://stackoverflow.com/questions/35947309/predict-with-new-random-effects/42906945#42906945

@carlbfrederick You know REimpact does seem to have the kernel of the truth we are looking for.

I think I see the issue now... people seem to be asking for something along the line of a aggregated predictInterval function? E.g. what is the average prediction on the scale of the outcome variable for some value of fixed covariates for each school/state/etc. This would allow one to look purely at the differences between schools for some set of schools.

Does that capture what you think the issue is?

If so the issues to address include:

  • choosing appropriate fixed values: easily addressed by manipulating the newdata.
  • Method of averaging/marginalizing: 2 ways to approach (via bootstrapping/simulations (more easy, more computationally expensive) or algebraically (more difficult (especially to generalize), more finesse)
  • How to accommodate models beyond 2 levels ... biggest potential pain

What do you think?

Yes. I think that is correct. What confuses me is what values are they looking for in the effects they are not interested in? Do they truly want to sample across them, or just set them to zero?

A few people have referenced the lme4 method of allowing re.form=~0, etc. in the predict function...

I think we could get at something like that as the interface, and underneath do what you describe.

Bingo @bachlaw -- I think I get it now. Not sure what my block was, but this is something we can implement for sure.

@carlbfrederick -- I would have time to work on this later this week, but happy to let you take a crack at it first if you have availability/interest, otherwise I can ping you when I have something worth reviewing

@jknowles I have some free time too, but don't wait on me. I will work on my own branch and let you know if I stumble onto something promising.

@jknowles I have yet to get to this (work has a way of piling up and I am foolishly trying to add some automatic reporting to DEWS). I just wanted to check in and see if you were making any progress.

No, unfortunately @carlbfrederick I got busy right when I had started to sort this out. Looking like the sky is going to clear up here in mid-May -- if you haven't tackled it by then, I will certainly be able to dive in.

Hey @carlbfrederick did you ever tackle this? I have a bunch of imputation related changes I am pushing to the project, and tidying up some other issues. I'd like to shoot for a new release in time for the fall semester (when many HLM students will discover our package fresh) :-)

What do you think?

@jknowles I have some time coming up in the next few weeks. I don't know how your schedule will work out, but if you can wait until after then to get started, I will let you know after that first week in August to see if I have been able to work it in. What do you think?

That sounds good to me @carlbfrederick -- I am moving and finishing up some other stuff the end of July, so pushing out in late August / early September would be ideal for release time.

@jknowles I spent some time today digging back into predictInterval() and looking at how predict.merMod(..., re.form = ~0) works. We already do something similar with the which = "all" argument, no? That provides a prediction for each grouping variable, all grouping variables, no grouping variables.

I don't know that we need to change anything within predictInterval(), simply adjust REimpact to take a new argument about which grouping variables to set to zero/not set to zero and then combining various results from a call to predictInterval(..., which = "all").

Does that make sense?

Or we could always tweak the which argument of predictInterval to allow users to specify multiple RE terms and that would add value to both.

Yes @carlbfrederick your approach is what I was thinking we need to do. We also need to then document this feature in the predictInterval() documentation so users know how to use that feature.

Any chance you can take a stab at it? No problem if you can't -- just planning my work the next few weeks!

It sounds like you guys are actively working on implementing the functionality I'm looking for with predictInterval(). Please let me know if and when it is available for use. Thanks kindly.

Casey Justice
jusc@critfc.org

I've got a draft idea of this in the inst subfolder - it is almost working. Performance is not optimal and I am not sure what the final results should look like but the REmargins function should get wrapped up sometime soon to allow this functionality more easily.

This was fixed with the most recent release, there is now an reMargins function that tackles this. Thanks for your reports @CJustice77 and @bachlaw