Handle_unknown in OneHotEncoder (FAMD) method

Question

Handle_unknown in OneHotEncoder (FAMD) method

A-acuto opened this issue a year ago · comments

Hi, at the moment, applying the FAMD method the categorical columns are preprocessed using OneHotEncoder from Sklearn library without any specification on how to handle unknown (so raising errors if new data is encountered).

I was wondering if there the interest of implementing some changes to allow to choose how the unknowns are handled (inherited from Sklearn) either "raising errors", "ignore" or "substitute".

Thanks for the really handy package!

Max Halford · Answer 1 · Sat Apr 29 2023 03:31:28 GMT+0800 (China Standard Time)

Hey! So you would essentially like to control the handle_unknown parameter of the OneHotEncoder? The parameters I see there are {‘error’, ‘ignore’, ‘infrequent_if_exist’}, but not substitute.

Alberto Acuto · Answer 2 · Tue May 02 2023 16:39:58 GMT+0800 (China Standard Time)

Yes, it would make it easier when dealing with continuous streaming of data with high dimensionality (so new values for certain variables are coming infrequently, so they raise errors and exit the computations).
I hope this is clear
Thanks

Max Halford · Answer 3 · Tue May 02 2023 23:59:48 GMT+0800 (China Standard Time)

Ok I've added a handle_unknown parameter to FAMD, that will be fed to OneHotEncoder. It's available in version 0.10.4 :)

Vlad Byzov · Answer 4 · Sat Jun 17 2023 20:00:05 GMT+0800 (China Standard Time)

Hello! I would like to mention that it would be nice to include this parameter for MCE just as you did for FAMD since it uses encodes categorical features aswell. It will have to be rewritten a little since as I see it uses pd.get_dummies instead of OneHotEncoder. I wish I could fork and rewrite it by myself, but I don't have so much time at the moment :(

Max Halford · Answer 5 · Tue Jun 20 2023 23:24:38 GMT+0800 (China Standard Time)

Hey @TalkativeGuy. Actually, for MCA, it's valid to pass columns and column values which were not seen before. Do you have an example/usecase in mind?