MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page:https://maxhalford.github.io/prince

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for sklearn Pipelines

MyNameIsFu opened this issue · comments

MCA is currently not able to be part of a sklearn Pipeline containing any preceding steps.
In my case I need an Imputer to fill any NaN values.

Working Example:

from sklearn.impute import SimpleImputer
from prince.mca import MCA

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("mca", MCA()),
])
test.fit_transform(test_data)

But including a SimpleImputer results in a numpy array that is being forwarded to the MCA:

from sklearn.impute import SimpleImputer
from prince.mca import MCA

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
    ("mca", MCA()),
])
test.fit_transform(test_data)

I've tried including a dummy transformer step betwen the imputer and MCA that forwards an arbitrary DataFrame with generic index and column labels, but it results in a KeyError with unknown Index labels being searched in the column list:

KeyError: "None of [Index(['Col_0_0.0', 'Col_0_1.0', 'Col_0_2.0', 'Col_0_3.0', 'Col_0_4.0',\n       'Col_0_5.0', 'Col_1_0.0', 'Col_1_1.0', 'Col_1_2.0', 'Col_2_0.0',\n       'Col_2_1.0', 'Col_3_0.0', 'Col_3_1.0'],\n      dtype='object')] are in the [columns]"

Any suggestions?

Hey there @MyNameIsFu!

I believe you can make this work using sklearn's set_output API:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from prince.mca import MCA
import numpy as np

test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
    ("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
    ("mca", MCA()),
])
test[0].set_output(transform="pandas")
test.fit_transform(test_data)

I hope this works for you!

I added a note to the FAQ on the documentation website. I'll close this :)