Handle numerical missing values in TableVectorizer

Question

Handle numerical missing values in TableVectorizer

tomMoral opened this issue 7 months ago · comments

Problem Description

Missing values are not handle by default

A reproducer:

from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml


X_df, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_df, y)

model = make_pipeline(TableVectorizer(), RandomForestClassifier())
model.fit(X_train, y_train).score(X_test, y_test)

Gives

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 12
      9 X_train, X_test, y_train, y_test = train_test_split(X_df, y)
     11 model = make_pipeline(TableVectorizer(), RandomForestClassifier())
---> 12 model.fit(X_train, y_train).score(X_test, y_test)

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
    403     if self._final_estimator != "passthrough":
    404         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    407 return self

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/ensemble/_forest.py:345, in BaseForest.fit(self, X, y, sample_weight)
    343 if issparse(y):
    344     raise ValueError("sparse multilabel-indicator for y is not supported.")
--> 345 X, y = self._validate_data(
    346     X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE
    347 )
    348 if sample_weight is not None:
    349     sample_weight = _check_sample_weight(sample_weight, X)

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    582         y = check_array(y, input_name="y", **check_y_params)
    583     else:
--> 584         X, y = check_X_y(X, y, **check_params)
    585     out = X, y
    587 if not no_val_X and check_params.get("ensure_2d", True):

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1101         estimator_name = _check_estimator_name(estimator)
   1102     raise ValueError(
   1103         f"{estimator_name} requires y to be passed, but the target y is None"
   1104     )
-> 1106 X = check_array(
   1107     X,
   1108     accept_sparse=accept_sparse,
   1109     accept_large_sparse=accept_large_sparse,
   1110     dtype=dtype,
   1111     order=order,
   1112     copy=copy,
   1113     force_all_finite=force_all_finite,
   1114     ensure_2d=ensure_2d,
   1115     allow_nd=allow_nd,
   1116     ensure_min_samples=ensure_min_samples,
   1117     ensure_min_features=ensure_min_features,
   1118     estimator=estimator,
   1119     input_name="X",
   1120 )
   1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1124 check_consistent_length(X, y)

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    915         raise ValueError(
    916             "Found array with dim %d. %s expected <= 2."
    917             % (array.ndim, estimator_name)
    918         )
    920     if force_all_finite:
--> 921         _assert_all_finite(
    922             array,
    923             input_name=input_name,
    924             estimator_name=estimator_name,
    925             allow_nan=force_all_finite == "allow-nan",
    926         )
    928 if ensure_min_samples > 0:
    929     n_samples = _num_samples(array)

File ~/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    144 if estimator_name and input_name == "X" and has_nan_error:
    145     # Improve the error message on how to handle missing values in
    146     # scikit-learn.
    147     msg_err += (
    148         f"\n{estimator_name} does not accept missing values"
    149         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    159         "#estimators-that-handle-nan-values"
    160     )
--> 161 raise ValueError(msg_err)

ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Feature Description

if missing value in numeric, use SimpleImputer

Alternative Solutions

No response

Additional Context

No response

Gael Varoquaux · Answer 1 · Tue Dec 19 2023 22:22:13 GMT+0800 (China Standard Time)

This is not a bug in TableVectorizer: it's down to the learner to handle missing values (because the strategy to handle missing values must differ depending on the learner).

If the learning does not handle missing values, you should add an imputer (as you did)

In addition, RandomForests handle missing values in the upcoming release of scikit-learn: scikit-learn/scikit-learn#5870
So your specific problem will disappear real soon.

However, we recommend using HistGradientBoosting avec RandomForest it often works better.

Jérôme Dockès · Answer 2 · Tue Dec 19 2023 22:38:23 GMT+0800 (China Standard Time)

still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support for missing values, and missing values are ubiquitous, so it is worth trying to find ways to improve the user experience and I would suggest keeping the issue open for discussion

Jérôme Dockès · Answer 3 · Tue Dec 19 2023 22:41:42 GMT+0800 (China Standard Time)

But I agree that at least the default should probably be to output nans where there are missing values as is currently the case

Thomas Moreau · Answer 4 · Tue Dec 19 2023 23:54:28 GMT+0800 (China Standard Time)

I agree that this depends on the downstream classifier but I think having an option to "fill missing value" would be a nice feature as the goal of TableVectorizer is to take a table and "vectorize" it. (that is why this is a feature request and not a bug ;) )

Gael Varoquaux · Answer 5 · Tue Dec 19 2023 23:59:02 GMT+0800 (China Standard Time)

I disagree with your desire to have an option to do it automatically: there is no good default and it tends to depend a lot on the downstream estimator.

If you really want good behavior by default, you should really use HistGradientBoosting, which is very robust to many thing.

Gael Varoquaux · Answer 6 · Wed Dec 20 2023 00:00:15 GMT+0800 (China Standard Time)

And, besides, it's not very hard to write:

make_pipeline(TableVectorizer(), SimpleImputer(), RandomForestClassifier())

Not much more difficult than:

make_pipeline(TableVectorizer(), RandomForestClassifier())

Thomas Moreau · Answer 7 · Wed Dec 20 2023 01:51:07 GMT+0800 (China Standard Time)

Yes it is easy to fix (I used numerical_transformer=SimpleImputer) but I found this behavior unexpected as I thought (without reading the doc) that I would get a vector out of this Transformer.
I find the name confusing as this does not vectorize the table (to me, a vector should have a consistent type for all its entries).
This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer, to make it clear it does not touch the numerics.

my 2cts :)

Gael Varoquaux · Answer 8 · Wed Dec 20 2023 02:20:56 GMT+0800 (China Standard Time)

This class only acts on the categories and not the numerical values so maybe it would be better to call it CategoryVectorizer, to make it clear it does not touch the numerics.

It also deals with the dates