Support for sklearn Pipelines
MyNameIsFu opened this issue · comments
MCA is currently not able to be part of a sklearn Pipeline containing any preceding steps.
In my case I need an Imputer to fill any NaN values.
Working Example:
from sklearn.impute import SimpleImputer
from prince.mca import MCA
test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
("mca", MCA()),
])
test.fit_transform(test_data)
But including a SimpleImputer results in a numpy array that is being forwarded to the MCA:
from sklearn.impute import SimpleImputer
from prince.mca import MCA
test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
("mca", MCA()),
])
test.fit_transform(test_data)
I've tried including a dummy transformer step betwen the imputer and MCA that forwards an arbitrary DataFrame with generic index and column labels, but it results in a KeyError with unknown Index labels being searched in the column list:
KeyError: "None of [Index(['Col_0_0.0', 'Col_0_1.0', 'Col_0_2.0', 'Col_0_3.0', 'Col_0_4.0',\n 'Col_0_5.0', 'Col_1_0.0', 'Col_1_1.0', 'Col_1_2.0', 'Col_2_0.0',\n 'Col_2_1.0', 'Col_3_0.0', 'Col_3_1.0'],\n dtype='object')] are in the [columns]"
Any suggestions?
Hey there @MyNameIsFu!
I believe you can make this work using sklearn's set_output
API:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from prince.mca import MCA
import numpy as np
test_data = pd.DataFrame(data=np.random.random((10, 5)))
test = Pipeline(steps=[
("impute", SimpleImputer()), # This Breaks the Pipeline since it returns an ndarray
("mca", MCA()),
])
test[0].set_output(transform="pandas")
test.fit_transform(test_data)
I hope this works for you!
I added a note to the FAQ on the documentation website. I'll close this :)