MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page:https://maxhalford.github.io/prince

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle_unknown in OneHotEncoder (FAMD) method

A-acuto opened this issue · comments

Hi, at the moment, applying the FAMD method the categorical columns are preprocessed using OneHotEncoder from Sklearn library without any specification on how to handle unknown (so raising errors if new data is encountered).

I was wondering if there the interest of implementing some changes to allow to choose how the unknowns are handled (inherited from Sklearn) either "raising errors", "ignore" or "substitute".

Thanks for the really handy package!

Hey! So you would essentially like to control the handle_unknown parameter of the OneHotEncoder? The parameters I see there are {‘error’, ‘ignore’, ‘infrequent_if_exist’}, but not substitute.

Yes, it would make it easier when dealing with continuous streaming of data with high dimensionality (so new values for certain variables are coming infrequently, so they raise errors and exit the computations).
I hope this is clear
Thanks

Ok I've added a handle_unknown parameter to FAMD, that will be fed to OneHotEncoder. It's available in version 0.10.4 :)

Hello! I would like to mention that it would be nice to include this parameter for MCE just as you did for FAMD since it uses encodes categorical features aswell. It will have to be rewritten a little since as I see it uses pd.get_dummies instead of OneHotEncoder. I wish I could fork and rewrite it by myself, but I don't have so much time at the moment :(

Hey @TalkativeGuy. Actually, for MCA, it's valid to pass columns and column values which were not seen before. Do you have an example/usecase in mind?