marginalize over RE

Question

marginalize over RE

bachlaw opened this issue 8 years ago · comments

Jared,

Great work as usual. "Given your response to the other issue, I take it that offering the re.form function from predict.merMod just isn't workable at this time, to marginalize over random effects?

thanks,
Jonathan

Jared Knowles · Answer 1 · Fri Feb 24 2017 21:55:26 GMT+0800 (China Standard Time)

Currently I don't think that is possible. However, I have been thinking about rewriting the REimpact function, which does this in a different but related way, to allow this.

Can you describe your use case in a bit more detail. Are you looking to get a prediction interval that averages across all the levels of the random effects but at a fixed set of values for the non-random effects? Is it most common that you would want to do this to an entire data frame at once and return (lower, fit, upper) for each row?

This might be a relatively easy fix and would solve a nagging problem I have about why REimpact does not seem to be as useful as I imagined.

bachlaw · Answer 2 · Wed Mar 01 2017 03:36:08 GMT+0800 (China Standard Time)

Sure Jared.

A baseball example: say I want to know the likely contributions of a batter, pitcher, and catcher to a pitch getting called a strike. I set up 3 random effects / groups, one for all players at each position, for all players in each category and run a glmer. I then want to predict outcomes without various categories so I can get a sense of how an average performer in that category would affect the result, and by extension, tell me the effect of the people in those categories. So it if were batters I wanted to leave out, it would be predict merMod, re.form=~(1|pitcher) + (1|catcher). It amounts to a glorified coefficient extraction, but particularly when a transformation needs to be applied, it is very handy.

You could do the same thing in an educational setting with groups of, say, (1|teacher), (1|school), and (1|district). Let's say you wanted to make predictions using school and district only, to see what effect an average teacher would have versus the normal predictions with the teachers present. You could then easily group the results with and without the teachers to get a mean effect on the outcome of various teachers to see if teacher variation is driving quality of outcomes. (I think Gelman uses an example like this in his multilevel modeling book).

Does this make sense?

Thanks,
Jonathan

Jared Knowles · Answer 3 · Wed Mar 01 2017 03:44:58 GMT+0800 (China Standard Time)

Thanks @bachlaw

That does make sense.

Can you tell me what you would expect it to look like in the case where you had a slope and an intercept at the random level: e.g. (1 + class_size | teacher) + (1|district) + (1 + mean_test |school). Would you want the ability to ignore the slope term (class_size)? What about the slopes for school when you are averaging over teachers?

After that, I think I can probably work something out and put this together relatively quickly...

Jared Knowles · Answer 4 · Wed Mar 22 2017 01:23:49 GMT+0800 (China Standard Time)

@carlbfrederick Do you have an idea of how to solve this? I think I have seen about a dozen questions along these lines in the past few weeks, but I'm not sure how this is different from the output we provide for REsim.

I am going to code up an example in the branch margins and see what I can see...

Carl Frederick · Answer 5 · Wed Mar 22 2017 05:25:43 GMT+0800 (China Standard Time)

@jknowles Hmmm...

Just trying to process here. Conceptually, the problem is trivial, no? The average teacher has a value of 0 for all random coefficients by definition. So, I think it is similar to REimpact (as much as I can remember it). That is how I would initially answer the question in the second paragraph of @bachlaw's most recent comment above. (e.g. compare moving from average teacher (or results "marginalized over" the teacher random effects) to a teacher at x percent of the distribution.

As a coding/usability issue, I am not sure if the question is a new feature or just better or more sophisticated packaging/interaction with REimpact I will think about this over the next few days and check out the margins branch to see if any thoughts come to mind.

Put some links or comments with the other similar questions in some document in the branch too. This might help me see some of the commonality.

Jared Knowles · Answer 6 · Wed Mar 22 2017 06:23:06 GMT+0800 (China Standard Time)

This came in an e-mail:

OK, let me describe my situation with a more concrete (but fake) example. Suppose that I have a dataset of SAT scores sampled from 24 high schools in each of the 6 states in the mid-atlantic region. I would like to show two things: 1) the average SAT score for each of the 6 states while controlling the average family income at the overall average income, and 2) the effect of family income for each state.

library('lme4')
mydata <- Penicillin
set.seed(10)
mydata$income <- rnorm(144)
mydata$school <- Penicillin$plate
mydata$state <- Penicillin$sample
mydata$SAT <- Penicillin$diameter

I can analyze the data “mydata” for each state separately. For example, for the first state:

m1 <- lm(SAT ~ income, data=mydata[mydata$state==‘A’,])

and the following results

summary(m1)$coefficients

               Estimate Std. Error     t value     Pr(>|t|)
(Intercept) 25.15231111  0.2171386 115.8352824 3.805796e-32
income      -0.03788689  0.2176436  -0.1740776 8.633964e-01

would allow me to achieve the two goals 1) and 2) above, and plot the results similar to Figure 6 in the Gelman et al. paper.

However, we need to perform multiple comparison correction since the above linear regression analysis is done 6 times (for the 6 states). The Gelman et al. paper argues that multiple comparison correction can be better handled through one (instead of 6) analysis with a multilevel model:

fm <- lmer(SAT ~ income + (income|school) + (1|state), data=mydata)

and then we can somehow obtain the two effects and their standard errors in 1) and 2) for each state as shown in Figure 7 of Gelman et al. It’s not clear to me how they exactly pulled the posterior data for their Figure 7, but the algebra is shown on pages 196-197.

Jared Knowles · Answer 7 · Wed Mar 22 2017 06:24:21 GMT+0800 (China Standard Time)

Also this fumbling attempt on SO for me to answer a similar question (I think similar)?

http://stackoverflow.com/questions/35947309/predict-with-new-random-effects/42906945#42906945

Jared Knowles · Answer 8 · Wed Mar 22 2017 06:30:52 GMT+0800 (China Standard Time)

@carlbfrederick You know REimpact does seem to have the kernel of the truth we are looking for.

Carl Frederick · Answer 9 · Thu Mar 23 2017 01:07:37 GMT+0800 (China Standard Time)

I think I see the issue now... people seem to be asking for something along the line of a aggregated predictInterval function? E.g. what is the average prediction on the scale of the outcome variable for some value of fixed covariates for each school/state/etc. This would allow one to look purely at the differences between schools for some set of schools.

Does that capture what you think the issue is?

If so the issues to address include:

choosing appropriate fixed values: easily addressed by manipulating the newdata.
Method of averaging/marginalizing: 2 ways to approach (via bootstrapping/simulations (more easy, more computationally expensive) or algebraically (more difficult (especially to generalize), more finesse)
How to accommodate models beyond 2 levels ... biggest potential pain

What do you think?

Jared Knowles · Answer 10 · Thu Mar 23 2017 02:39:50 GMT+0800 (China Standard Time)

Yes. I think that is correct. What confuses me is what values are they looking for in the effects they are not interested in? Do they truly want to sample across them, or just set them to zero?

A few people have referenced the lme4 method of allowing re.form=~0, etc. in the predict function...

I think we could get at something like that as the interface, and underneath do what you describe.

bachlaw · Answer 11 · Thu Mar 23 2017 02:55:12 GMT+0800 (China Standard Time)

The effect of re.form is to set all REs excluded from the statement to 0 (average) to get a "with and without" effect. You might say this sounds a lot like extracting the intercept and you would be correct. But in cases with transformed distributions (e.g., logistic regression) it makes isolating the net effect much more handy. Here, I think the value is to see, setting one or more REs to zero, how the predictions would be different. It is common then to take the average of the differences as a measure of each level of the RE. The ability to generate intervals around these altered predictions would be great, if possible. Jonathan

…

Sent from my iPhone On Mar 22, 2017, at 2:39 PM, Jared Knowles <notifications@github.com<mailto:notifications@github.com>> wrote: Yes. I think that is correct. What confuses me is what values are they looking for in the effects they are not interested in? Do they truly want to sample across them, or just set them to zero? A few people have referenced the lme4 method of allowing re.form=~0, etc. in the predict function... I think we could get at something like that as the interface, and underneath do what you describe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALDpUDcGV3_f9s0-foo53yU9ENEqK8y5ks5roWr2gaJpZM4MKgd6>.

Jared Knowles · Answer 12 · Thu Mar 23 2017 03:03:20 GMT+0800 (China Standard Time)

Bingo @bachlaw -- I think I get it now. Not sure what my block was, but this is something we can implement for sure.

@carlbfrederick -- I would have time to work on this later this week, but happy to let you take a crack at it first if you have availability/interest, otherwise I can ping you when I have something worth reviewing

Carl Frederick · Answer 13 · Thu Mar 23 2017 03:23:30 GMT+0800 (China Standard Time)

@jknowles I have some free time too, but don't wait on me. I will work on my own branch and let you know if I stumble onto something promising.

bachlaw · Answer 14 · Thu Mar 23 2017 03:44:51 GMT+0800 (China Standard Time)

Thanks to all and I apologize for not doing a better job explaining it sooner!

…

Sent from my iPhone On Mar 22, 2017, at 3:23 PM, Carl Frederick <notifications@github.com<mailto:notifications@github.com>> wrote: @jknowles<https://github.com/jknowles> I have some free time too, but don't wait on me. I will work on my own branch and let you know if I stumble onto something promising. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALDpUOs7GzdF-7Wa1r7RxLZZIKu4fd4Oks5roXUygaJpZM4MKgd6>.

Carl Frederick · Answer 15 · Tue Apr 11 2017 01:02:29 GMT+0800 (China Standard Time)

@jknowles I have yet to get to this (work has a way of piling up and I am foolishly trying to add some automatic reporting to DEWS). I just wanted to check in and see if you were making any progress.

Jared Knowles · Answer 16 · Tue Apr 11 2017 03:09:43 GMT+0800 (China Standard Time)

No, unfortunately @carlbfrederick I got busy right when I had started to sort this out. Looking like the sky is going to clear up here in mid-May -- if you haven't tackled it by then, I will certainly be able to dive in.

Carl Frederick · Answer 17 · Tue Apr 11 2017 03:51:53 GMT+0800 (China Standard Time)

Cool. I'll put it back on my todo list then.

…

On Apr 10, 2017, at 2:09 PM, Jared Knowles ***@***.***> wrote: No, unfortunately @carlbfrederick I got busy right when I had started to sort this out. Looking like the sky is going to clear up here in mid-May -- if you haven't tackled it by then, I will certainly be able to dive in. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Jared Knowles · Answer 18 · Wed Jul 12 2017 08:25:07 GMT+0800 (China Standard Time)

Hey @carlbfrederick did you ever tackle this? I have a bunch of imputation related changes I am pushing to the project, and tidying up some other issues. I'd like to shoot for a new release in time for the fall semester (when many HLM students will discover our package fresh) :-)

What do you think?

Carl Frederick · Answer 19 · Wed Jul 12 2017 08:45:42 GMT+0800 (China Standard Time)

I haven't. I am still deep in trying to work out the DEWS missing data stuff. Let me look at my calendar tomorrow and see if I can block some time on your schedule.

…

On Jul 11, 2017, at 7:25 PM, Jared Knowles ***@***.***> wrote: Hey @carlbfrederick did you ever tackle this? I have a bunch of imputation related changes I am pushing to the project, and tidying up some other issues. I'd like to shoot for a new release in time for the fall semester (when many HLM students will discover our package fresh) :-) What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Carl Frederick · Answer 20 · Thu Jul 13 2017 03:11:55 GMT+0800 (China Standard Time)

@jknowles I have some time coming up in the next few weeks. I don't know how your schedule will work out, but if you can wait until after then to get started, I will let you know after that first week in August to see if I have been able to work it in. What do you think?

Jared Knowles · Answer 21 · Thu Jul 13 2017 21:15:35 GMT+0800 (China Standard Time)

That sounds good to me @carlbfrederick -- I am moving and finishing up some other stuff the end of July, so pushing out in late August / early September would be ideal for release time.

Carl Frederick · Answer 22 · Sat Jul 29 2017 03:10:08 GMT+0800 (China Standard Time)

@jknowles I spent some time today digging back into predictInterval() and looking at how predict.merMod(..., re.form = ~0) works. We already do something similar with the which = "all" argument, no? That provides a prediction for each grouping variable, all grouping variables, no grouping variables.

I don't know that we need to change anything within predictInterval(), simply adjust REimpact to take a new argument about which grouping variables to set to zero/not set to zero and then combining various results from a call to predictInterval(..., which = "all").

Does that make sense?

Or we could always tweak the which argument of predictInterval to allow users to specify multiple RE terms and that would add value to both.

Jared Knowles · Answer 23 · Tue Aug 01 2017 21:53:42 GMT+0800 (China Standard Time)

Yes @carlbfrederick your approach is what I was thinking we need to do. We also need to then document this feature in the predictInterval() documentation so users know how to use that feature.

Any chance you can take a stab at it? No problem if you can't -- just planning my work the next few weeks!

Casey Justice · Answer 24 · Sat Aug 19 2017 07:41:45 GMT+0800 (China Standard Time)

It sounds like you guys are actively working on implementing the functionality I'm looking for with predictInterval(). Please let me know if and when it is available for use. Thanks kindly.

Casey Justice
jusc@critfc.org

Jared Knowles · Answer 25 · Sun Oct 21 2018 23:51:30 GMT+0800 (China Standard Time)

I've got a draft idea of this in the inst subfolder - it is almost working. Performance is not optimal and I am not sure what the final results should look like but the REmargins function should get wrapped up sometime soon to allow this functionality more easily.

Jared Knowles · Answer 26 · Tue Jan 14 2020 05:34:12 GMT+0800 (China Standard Time)

This was fixed with the most recent release, there is now an reMargins function that tackles this. Thanks for your reports @CJustice77 and @bachlaw