Support for sklearn Pipelines

Question

Support for sklearn Pipelines

MyNameIsFu opened this issue 8 months ago · comments

MCA is currently not able to be part of a sklearn Pipeline containing any preceding steps.
In my case I need an Imputer to fill any NaN values.

Working Example:

from sklearn.impute import SimpleImputer
from prince.mca import MCA

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("mca", MCA()),
])
test.fit_transform(test_data)

But including a SimpleImputer results in a numpy array that is being forwarded to the MCA:

from sklearn.impute import SimpleImputer
from prince.mca import MCA

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
    ("mca", MCA()),
])
test.fit_transform(test_data)

I've tried including a dummy transformer step betwen the imputer and MCA that forwards an arbitrary DataFrame with generic index and column labels, but it results in a KeyError with unknown Index labels being searched in the column list:

KeyError: "None of [Index(['Col_0_0.0', 'Col_0_1.0', 'Col_0_2.0', 'Col_0_3.0', 'Col_0_4.0',\n       'Col_0_5.0', 'Col_1_0.0', 'Col_1_1.0', 'Col_1_2.0', 'Col_2_0.0',\n       'Col_2_1.0', 'Col_3_0.0', 'Col_3_1.0'],\n      dtype='object')] are in the [columns]"

Any suggestions?

Max Halford · Answer 1 · Sun Feb 11 2024 20:44:32 GMT+0800 (China Standard Time)

Hey there @MyNameIsFu!

I believe you can make this work using sklearn's set_output API:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from prince.mca import MCA
import numpy as np

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
    ("mca", MCA()),
])
test[0].set_output(transform="pandas")
test.fit_transform(test_data)

I hope this works for you!

Max Halford · Answer 2 · Sat Sep 07 2024 21:50:00 GMT+0800 (China Standard Time)

I added a note to the FAQ on the documentation website. I'll close this :)