matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extending `formulaic` to work with other input types

bnjhng opened this issue · comments

Context:
My team uses patsy heavily. One aspect of patsy that makes it great for our use cases is the fact that patsy isn't strict about its input types. For example, patsy works when inputted pd.DataFrame({"a": np.array([1,2,3])}) equally as well as when inputted {"a": np.array([1,2,3])}, and we have use cases for storing data as a dictionary of numpy arrays.

Problem:
Currently, we are in the process of placing patsy with formulaic in our workflow, and we encountered a problem because unlike patsy, formulaic throws an error if we try passing in a dictionary of numpy arrays the way we did when using patsy. The error we get is FormulaMaterializerNotFoundError: No materializer has been registered for input type 'builtins.dict'. However we noticed that formulaic's docs mentions:

  • extensible data input/output plugins, with implementations for:
    • input:
      • pandas.DataFrame
      • pyarrow.Table
    • output:
      • pandas.DataFrame
      • numpy.ndarray
      • scipy.sparse.CSCMatrix

So we suspect that the lack of support for builtins.dict isn't a fundamental limitation of formulaic but rather something we need to provide if we want to use formulaic the way we have been using patsy.

Attempted solution:
We did some digging into materializers and was able to hack together something that works (i.e., passes all our existing unit tests) without really understanding what materializers are for and how they work.
In essence, we created a custom DictMaterializer and passed that into model_matrix like this (where data in this example is a dictionary of numpy arrays):

formulaic.model_matrix(spec, data, na_action="ignore", materializer=DictMaterializer)

And we defined DictMaterializer thusly:

from typing import Any, Dict, Sequence, Tuple
from formulaic.model_spec import ModelSpec
from interface_meta import override
from formulaic.materializers import PandasMaterializer

class DictMaterializer(PandasMaterializer):
    REGISTER_NAME = "builtins"
    REGISTER_INPUTS = ("builtins.dict", )
    REGISTER_OUTPUTS = ("builtins",)

    @override
    def _combine_columns(self, cols: Sequence[Tuple[str, Any]], spec: ModelSpec, drop_rows: Sequence[int]
    ) -> Dict:
        return {col[0]: col[1] for col in cols}

We recognize that our solution is probably not ideal for 3 reasons:

  1. We subclassed from PandasMaterializer instead of FormulaMaterializer. We did this to get something working quickly.
  2. We only overloaded the _combine_columns method while totally ignoring all the other ones implemented by PandasMaterializer. Again, _combine_columns was the only method we had to overload to get our unit tests to pass.
  3. We currently only have use cases for na_action="ignore" and so in our implementation of _combine_columns, the drop_rows input is complete ignored.

Help needed:
We are reaching out in this forum to see if we are on the right track, and if so, what we can do to get our solution to a more ideal state.
It would also be very helpful if we can get a summary/explanation for what materializers are for as well as how they are meant to work.

@matthewwardrop

Hi @bnjhng. Thanks for reaching out.

In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?

In terms of the right solution to this, your approach isn't crazy! It likely misses some edge-cases, but it probably works in the majority of cases. I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?

In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?

Thanks for linking it! Somehow I missed that section as I was going through the docs.

I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?

We have found that very often, when performing operations that do not involve row indexing (i.e., the vast majority of data transformations), working in dictionary of numpy arrays has speed advantage over pandas DataFrame.

It likely misses some edge-cases, but it probably works in the majority of cases.

This is indeed our conclusion! The main limitation we have encountered so far is with categorical encoding. And we have isolated the main issue to the following:

>>> from formulaic.materializers.types import FactorValues
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> # while this works as expected:
>>> print(pd.Series(FactorValues(pd.Series(["a", "b", "c"]))))
0    a
1    b
2    c
dtype: object

>>> # this doesn't give the expected results:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"]))))
0     abc
1    bc
2     c
dtype: object

>>> # and this straight up errors out:
>>> print(pd.Series(FactorValues(np.array([1, 2, 3]))))
TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got FactorValues)

The doctoring of FactorValues says that it is:

A convenience wrapper that surfaces a FactorValuesMetadata instance at
<object>.__formulaic_metadata__. This wrapper can otherwise wrap any
object and behaves just like that object.

But clearly, in the case of numpy.ndarray, the wrapper doesn't behave just like numpy.ndarray. To get the code above to work properly, one way is to explicitly call the __wrapped__ attribute:

# both of the following works as expected:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"])).__wrapped__))
0    a
1    b
2    c
dtype: object

>>> print(pd.Series(FactorValues(np.array([1, 2, 3])).__wrapped__))
0    1
1    2
2    3
dtype: int64

However, there are places in the formulaic code that simply does pandas.Series(data) instead of pandas.Series(data.__wrapped__), for example here and here.

Would it be possible to fix this limitation with formulaic?