matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow formatting the categorical encoded variables

hguturu opened this issue · comments

Currently they get formatted as C({parameter})[T.{value}] or {parameter}[T.{value}] if its already a string.
E.g.,

BinGrp = [0, 0, 0, 1, 1, 1]
becomes
   C(BinGrp)[T.0]  C(BinGrp)[T.1]
0               1               0
1               1               0
2               1               0
3               0               1
4               0               1
5               0               1

It would be nice if we could pass in a format string to get simpler names. E.g. BinGrp0, BinGrp1 if we pass in a format string like "{parameter}{value}"

Moved from #46 (comment)

I think it would be possible to easily add a format argument to the C() function; and the resulting formulae would look something like:

C(A, format="{variable}:{value}")

But presently the "variable" argument would be the entire C(A, format="{variable}:{value}"), not A. You could potentially fix this, but in principle you could encode A differently multiple times in the same formula... so I'm not sure yet whether this approach is worth pursuing.

Can you suggest a syntax that would make sense for you so we can further evaluate this?

Good point. I was coming more from the perspective of having easier to handle variable names.

e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment)"], all_phenotypes)

model = sm.OLS([1,2,3,1,2,3], design).fit()
model.summary()

model.t_test("C(BinGrp, contr.treatment)[T.1] - C(BinGrp, contr.treatment)[T.0]") # impressively works

But, a little cumbersome to do.

Similarly if you had multiple encodings e.g.,

design = formulaic.model_matrix(["C(BinGrp, contr.treatment) + poly(BinGrp) + exp(BinGrp)"], all_phenotypes)
   C(BinGrp, contr.treatment)[T.0]  C(BinGrp, contr.treatment)[T.1]  poly(BinGrp)[1]  exp(BinGrp)
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

Then I think you would specifiy a format for each one?

Using your suggested syntax:

C(BinGrp, contr.treatment) -> C(BinGrp, contr.treatment, format="{variable}:{value})
poly(BinGrp) -> poly(BinGrp, format="poly_{variable}_{value}")
exp(BinGrp) -> exp(BinGrp, format="{variable}") # e.g. you just want the value transformed but keep the name (silly transform)


                      BinGrp:0                              BinGrp:1   poly_BinGrp_1    BinGrp
0                                1                                0        -0.408248     1.000000
1                                1                                0        -0.408248     1.000000
2                                1                                0        -0.408248     1.000000
3                                0                                1         0.408248     2.718282
4                                0                                1         0.408248     2.718282
5                                0                                1         0.408248     2.718282

If format is not provided it falls back to the default?

Hmmmm... adding format arguments to every method is not really viable (we are just proxying numpy methods, and this wouldn't work for aliasing variables outside of a function call). We could obviously wrap these methods, but I'm not convinced this is a good idea.

After reflecting more on this, I think sensible (non-mutuially exclusive) ways forward might include:

  1. Adding support for format strings to categorical features to allow overriding the naming of columns combined with their levels. e.g.: C(X, fmt='{variable}.{level}')
  2. Add an aliasing operator along the lines of y ~ ("my_name":=C(X, fmt='...')
  3. Documenting better existing aliasing functionality:
import pandas
from formulaic import model_matrix
from formulaic.transforms import C

data = pandas.DataFrame({"X": ['a', 'b', 'c']})

my_var = C(data.X)
model_matrix("y ~ my_var", data)

I think I am leaning toward (1) and (3). I would consider implementing within formula aliasing if there were enough demand for it... but remain unconvinced at present.

I wasn't aware of 3. I tried it and it almost works, but the value var is still formatted differently.
e.g.

my_var = C(data.X)
model_matrix("~ my_var", data)

   Intercept  my_var[T.b]  my_var[T.c]
0        1.0            0            0
1        1.0            1            0
2        1.0            0            1

But, I was digging into the code a little bit and I realized there may be a simple enough way to get what is desired (although perhaps not stable across versions due to not being a "blessed" API).

import pandas
from formulaic import model_matrix
import formulaic

data = pandas.DataFrame({"X": ['a', 'b', 'c']})
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~C(X)", data)

   Intercept  C(X).b  C(X).c
0        1.0       0       0
1        1.0       1       0
2        1.0       0       1

This is almost the desired output. The ~C(X) is still being stored in the name. But, perhaps there is a similar hack for this as well? If I can track down where the name is being set.

I could do

from formulaic.transforms import C
my_var = C(data.X)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~my_var", data)

   Intercept  my_var.b  my_var.c
0        1.0         0         0
1        1.0         1         0
2        1.0         0         1

and that gets me exactly what is needed, but that requires knowing the contrast variables in the formula involves parsing the formula.

By chance, is there a similar format constant I can play with to get the formatting needed without an official format support?

It already works when I don't explicitly ask for a contrast coding, but converting by values to strings.

from formulaic.transforms import C
data.X = data.X.astype(str)
formulaic.transforms.contrasts.TreatmentContrasts.FACTOR_FORMAT = '{name}.{field}'
model_matrix("~ X", data)

   Intercept  X.b  X.c
0        1.0    0    0
1        1.0    1    0
2        1.0    0    1