FEAT enable pandas output in `TableVectorizer` as a parameter

Question

FEAT enable pandas output in `TableVectorizer` as a parameter

Vincent-Maladiere opened this issue a year ago · comments

Problem Description

Although TableVectorizer was initially designed to ingest dataframes and output arrays, sometimes users may prefer to output dataframes directly —e.g., to perform downstream Joiner/AggJoiner operations or even for debugging purposes.

This option is currently available with set_output(transform="pandas") via the SetOutputMixin inherited from TransformerMixin. However, most users won't know this option even exists.

Feature Description

Instead, I suggest adding a return_dataframe parameter to TableVectorizer's __init__ method.

It would be set to False by default
Setting it to True would run this snippet in fit_transform:

if self.return_dataframe:
    self.set_output(transform="pandas")

Alternative Solutions

Document the set_output method in skrub extensively. But having to use it is the kind of extra complexity people —above all beginners— complain about.

Additional Context

No response

Jovan Stojanovic · Answer 1 · Thu Aug 31 2023 17:09:26 GMT+0800 (China Standard Time)

Interesting idea!

I do see that there is an increasing dichotomy in skrub between relying on dataframes or arrays, so this is an important discussion.

This is due to the fact that we are somewhere in between projects like pandas and scikit-learn. That is, pipeline-wise, I usually see skrub as having dataframes as input (e.g. from pandas) and returns arrays (for scikit-learn) for machine learning. This is what is done by the TableVectorizer.

I would use then, and this is how it's currently done in the examples, Joiners as a step before TableVectorizer, which returns numerical arrays for scikit-learn models. In this case, no need for dataframes as output.

Gael Varoquaux · Answer 2 · Thu Aug 31 2023 17:20:27 GMT+0800 (China Standard Time)

This option is currently available with set_output(transform="pandas") via the SetOutputMixin inherited from TransformerMixin. However, most users won't know this option even exists. Instead, I suggest adding a return_dataframe parameter to TableVectorizer's __init__ method.

I'd rather not depart from the choices made in scikit-learn. We should rather better document this feature, which probably means using it more in our examples amongst other things.

Vincent M · Answer 3 · Thu Aug 31 2023 17:31:03 GMT+0800 (China Standard Time)

@jovan-stojanovic I see many use cases (including examples in the AggJoiner) where running TableVectorizer first is a must.

@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.

Let's close this issue, then.

Gael Varoquaux · Answer 4 · Thu Aug 31 2023 17:43:21 GMT+0800 (China Standard Time)

@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.

It's there for a variety of reasons: - It can be enforced consistency in all the estimators that inherit from BaseEstimator. We feared that support across all the scikit-learn compatible libraries would be inconsistent - It can also be controlled via https://scikit-learn.org/stable/modules/generated/sklearn.set_config.html. Here the tension is between user-facing code (ie a datascientist using scikit-learn to analyse data) and library code (ie someone writing a library using scikit-learn) - It opens the door to future evolution as the dataframe ecosystem change and we can do better support of a variety of containers. API choices in scikit-learn are made with a lot of care, and I hesitate in overuling them.

Vincent M · Answer 5 · Thu Aug 31 2023 17:49:54 GMT+0800 (China Standard Time)

Thanks for the precision. I hadn't all these elements in mind.

I have a huge bias toward the user side and will always promote stuff that eases the life of the regular data scientists. Ultimately, we do this for them. "User first and maintainers adapt" in some ways.