skrub-data / skrub

Prepping tables for machine learning

Home Page:https://skrub-data.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FEAT enable pandas output in `TableVectorizer` as a parameter

Vincent-Maladiere opened this issue · comments

Problem Description

Although TableVectorizer was initially designed to ingest dataframes and output arrays, sometimes users may prefer to output dataframes directly —e.g., to perform downstream Joiner/AggJoiner operations or even for debugging purposes.

This option is currently available with set_output(transform="pandas") via the SetOutputMixin inherited from TransformerMixin. However, most users won't know this option even exists.

Feature Description

Instead, I suggest adding a return_dataframe parameter to TableVectorizer's __init__ method.

  • It would be set to False by default
  • Setting it to True would run this snippet in fit_transform:
if self.return_dataframe:
    self.set_output(transform="pandas")

Alternative Solutions

Document the set_output method in skrub extensively. But having to use it is the kind of extra complexity people —above all beginners— complain about.

Additional Context

No response

Interesting idea!

I do see that there is an increasing dichotomy in skrub between relying on dataframes or arrays, so this is an important discussion.

This is due to the fact that we are somewhere in between projects like pandas and scikit-learn. That is, pipeline-wise, I usually see skrub as having dataframes as input (e.g. from pandas) and returns arrays (for scikit-learn) for machine learning. This is what is done by the TableVectorizer.

I would use then, and this is how it's currently done in the examples, Joiners as a step before TableVectorizer, which returns numerical arrays for scikit-learn models. In this case, no need for dataframes as output.

@jovan-stojanovic I see many use cases (including examples in the AggJoiner) where running TableVectorizer first is a must.

@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.

Let's close this issue, then.

Thanks for the precision. I hadn't all these elements in mind.

I have a huge bias toward the user side and will always promote stuff that eases the life of the regular data scientists. Ultimately, we do this for them. "User first and maintainers adapt" in some ways.