FEAT enable pandas output in `TableVectorizer` as a parameter
Vincent-Maladiere opened this issue · comments
Problem Description
Although TableVectorizer
was initially designed to ingest dataframes and output arrays, sometimes users may prefer to output dataframes directly —e.g., to perform downstream Joiner
/AggJoiner
operations or even for debugging purposes.
This option is currently available with set_output(transform="pandas")
via the SetOutputMixin
inherited from TransformerMixin
. However, most users won't know this option even exists.
Feature Description
Instead, I suggest adding a return_dataframe
parameter to TableVectorizer's __init__
method.
- It would be set to
False
by default - Setting it to
True
would run this snippet infit_transform
:
if self.return_dataframe:
self.set_output(transform="pandas")
Alternative Solutions
Document the set_output
method in skrub extensively. But having to use it is the kind of extra complexity people —above all beginners— complain about.
Additional Context
No response
Interesting idea!
I do see that there is an increasing dichotomy in skrub between relying on dataframes or arrays, so this is an important discussion.
This is due to the fact that we are somewhere in between projects like pandas and scikit-learn. That is, pipeline-wise, I usually see skrub as having dataframes as input (e.g. from pandas) and returns arrays (for scikit-learn) for machine learning. This is what is done by the TableVectorizer
.
I would use then, and this is how it's currently done in the examples, Joiners as a step before TableVectorizer
, which returns numerical arrays for scikit-learn models. In this case, no need for dataframes as output.
@jovan-stojanovic I see many use cases (including examples in the AggJoiner
) where running TableVectorizer
first is a must.
@GaelVaroquaux, I agree with you on the consistency with scikit-learn. I still think set_output
is confusing for newcomers who won't catch it in the doc and an awful design pattern from user perspective, IMHO.
Let's close this issue, then.
Thanks for the precision. I hadn't all these elements in mind.
I have a huge bias toward the user side and will always promote stuff that eases the life of the regular data scientists. Ultimately, we do this for them. "User first and maintainers adapt" in some ways.