How to include structural zeros?
windisch opened this issue · comments
What's the preferred way to model structural zeros in a Formula
?
Assume the following toy example: I have a
e | f | |
---|---|---|
a | 1 | 0 |
b | 2 | 3 |
c | 4 | 0 |
given as a pandas dataframe as follows:
df = pd.DataFrame(
data={
'F1': ['a', 'a', 'b', 'b', 'c', 'c'],
'F2': ['e', 'f', 'e', 'f', 'e', 'f'],
'n': [ 1, 0, 2, 3, 4, 0]
})
The combinations n ~ C(F1):C(F2)
on that data as follows
y, X = Formula('n ~ C(F1):C(F2)').get_model_matrix(df, ensure_full_rank=False)
then the corresponding variables C(F1)[T.a]:C(F2)[T.f]
and C(F1)[T.c]:C(F2)[T.f]
are columns of X
. Is there a way to remove these parameters already in the formula? Is there another concept in formulaic
to deal with this type of constraints?
Hi @windisch ,
Apologies for the delay in my response. Life has been pretty hectic of late.
At present, there is no way to handle this in Formulaic (short of deleting these columns after the model matrix is created). Is there precedent for supporting this kind of transformation in other formula implementations? (This isn't a requisite for including it in Formulaic, but it does help to think through how others have solved this issue).
If we were to add support for this, I think the easiest approach would be to generate the matrix as is, and then remove any columns that are identically zero. This does mean that some unnecessary work is done, which is a little inelegant... but I'm not sure it makes sense to pass around richer metadata than this. Of course, that means it could just as easily be done outside formulaic too.
In an ideal world, what would you like to see done?