bambinos / formulae

I just tried a model with a different train/test split for the fit() and predict() functionality and hit the following error in the predict() call in bambi and hit the following error:

formulae/formulae/terms/variable.py

Lines 223 to 226 in b926883

    
           raise ValueError( 
        
               f"The levels {', '.join(difference)} in '{self.name}' are not present in " 
        
               "the original data set." 
        
           )

For a lot of real world use cases it's quite common for unknown categories to show up at inference time (e.g. new users, new customers etc) and it would be great to try and handle this natively in the predict() call such that a 'default' prediction can be made based on the mean across all groups rather than a specific group.

I'm tempted to contribute this but I'm a little intimidated as I don't really know where to start! How I handle this in my own pymc models is extend the length of my group parameter vector by 1 and have the zeroth index be an 'unknown' category so at predict time I can set all unseen categories equal to that category. Is there an analogous way to add this to formulae / bambi? @tomicapretto if you have any advice as to what I'd need to do to implement this it'd be much appreciated!

@markgoodhead thanks for opening this issue. I'm sorry I couldn't reply earlier. Let me try to give an answer to this issue.

You're right that Formulae does not allow the creation of design matrices for out-of-sample data that contains unseen categories. How to predict a quantity for an unseen category is not a trivial problem, and I don't think that formulae should be the place where we want to implement one of the many possible arbitrary choices. Each modeling library should make its own decisions in this regard.

One good reference is the R library brms. They have dealt with this problem earlier than us. Have a look at the documentation of the prepare_predictions function.

I copy two relevant arguments

allow_new_levels: A flag indicating if new levels of group-level effects are allowed (defaults to FALSE). Only relevant if newdata is provided
sample_new_levels: Indicates how to sample new levels for grouping factors specified in re_formula. This argument is only relevant if newdata is provided and allow_new_levels is set to TRUE. If "uncertainty" (default), each posterior sample for a new level is drawn from the posterior draws of a randomly chosen existing level. Each posterior sample for a new level may be drawn from a different existing level such that the resulting set of new posterior draws represents the variation across existing levels. If "gaussian", sample new levels from the (multivariate) normal distribution implied by the group-level standard deviations and correlations. This options may be useful for conducting Bayesian power analysis or predicting new levels in situations where relatively few levels where observed in the old_data. If "old_levels", directly sample new levels from the existing levels, where a new level is assigned all of the posterior draws of the same (randomly chosen) existing level.

It is said implicitly in this documentation that you can only predict for new levels that are grouping factors in group-specific effects. In other words, if you have a categorical variable x, you can predict a new level if your model uses it as y ~ (1|x) or y ~ (z|x) but not if it is used as a common effect such as y ~ x.

I agree with the behavior in brms. I think that it only makes sense to predict new groups when you're using it as a group-specific effect (because of the partial pooling effect, you assume all the other groups can give information about your new group). I also think that we should not allow predicting for new groups that are part of the common effects because that violates the assumptions you make when you put it as a main effect.

In summary,

We should implement something to allow Formulae to generate design matrices for group-specific effects considering unseen groups.
How we handle those unseen groups is not a problem that we should solve in Formulae. That should be handled on Bambi's side.
- It won't be trivial either.

With all that said, do you want to work on it? I'm not sure where I would start, but if you want I could try to propose a solution and we could work through this together.

Thanks @tomicapretto - I agree with what you're saying here and I like the brms API so I think mimicing this in Bambi would be a good feature (apologies if I raised this in the wrong repo, it's only because formulae is where I hit the actual error message).

It sounds like it'll be quite an involved feature to add that'll require a broad understanding of the bambi codebase (I've not really looked at any of the code to do with design matrix production/handling, for example) but I'll start digging into how that works on the bambi end and see if I can work out what'd be required!

It needs work on both ends. First, some work here in Formulae so we can generate new design matrices for group specific effects with unseen levels.

Then, some work on Bambi side to get predictions. I think this part is going to be the trickiest. Our prediction code is quite messy now. I have a POC to improve it with xarray utilities but I didn't have time to do it yet.

A couple of places to look at

evaluate_new_data()
GroupSpecificTerm
- eval_new_data in there... and so on

Our prediction code is quite messy now. I have a POC to improve it with xarray utilities but I didn't have time to do it yet.

Perhaps I should hold off until post this refactor (unless you think it'll be some time until you have time to tackle it)?

Hi there, is there any update on this? I came across a similar problem. One change I was just tinkering with on my end was to change eval_new_data_categoric to something like

def eval_new_data_categoric(self, x):
    new_data_levels = set(x)
    original_levels = set(self.levels)
    
    all_levels = original_levels.union(new_data_levels)
    
    new_contrast_matrix = np.zeros((len(all_levels), self.contrast_matrix.matrix.shape[1]))
    for i, level in enumerate(all_levels):
        if level in original_levels:
            idx = levels.index(level)
            new_contrast_matrix[i] = self.contrast_matrix.matrix[idx]
            
    idxs = pd.Categorical(x, categories=all_levels).codes
    return new_contrast_matrix[idxs]

of course, this change wouldn't be suitable in Variable, since we probably wouldn't want to allow unseen levels for common effects. Mostly providing it for posterity. Happy to work on this if needed.

Hi @pnxenopoulos

You're more than welcome to work on this if you can. Regarding the solution shown, I think it is a good start but we also need to consider the order of the levels (since users can also modify this).

I think returning a vector of zeros for new levels makes sense, but I will keep thinking about it. I'm also considering how we would use this information in Bambi. So this approach would require identifying what are the observations that belong to a new group, and that identification could happen with something like (vector == 0).all()... Leaving this idea for posterity as well.

And regarding your initial question, I'm not actively working on it right now but it's near the top of my to-do list. I anticipate having it done by the end of May 2023.

Closing since this is already implemented. See #95, #96, and #100. This is included in the release I'm making right now.

Feel free to leave questions about how it works @markgoodhead @pnxenopoulos

Long story short, you can make predictions with new groups. Even though it "works" when the categoric predictor is included as a fixed effect, it only makes sense when the categoric variable is the grouping variable of a random effect.

	raise ValueError(
	f"The levels {', '.join(difference)} in '{self.name}' are not present in "
	"the original data set."
	)

Support unknown test time group categories