TableVectorizer imputing logic is confusing

Question

TableVectorizer imputing logic is confusing

Vincent-Maladiere opened this issue a year ago · comments

While working on #819, I found that the imputing behavior we currently use in the TableVectorizer.auto_cast for categorical is intriguing.

1. Two ways imputing

As stated by this comment, we currently:

Replace the "almost missing" strings with np.nan for all columns dtypes
Then, for non-numeric dtypes (string, categorical, and object), we do the opposite: we replace np.nan with the string "missing"

Why do we need to replace np.nan with "missing"?

2. Replacing non-numeric by `np.nan` when trained on numeric

When trained on a numerical column, TableVectorizer replaces the string/object dtypes of this column with np.nan during predict:

import numpy as np
import pandas as pd
from skrub import TableVectorizer

df_train = pd.DataFrame(dict(a=[np.nan, 1]))
df_test = pd.DataFrame(dict(a=["a", "b"]))

tv = TableVectorizer().fit(df_train)
tv.transform(df_test)
# /Users/vincentmaladiere/INRIA/skrub/skrub/_table_vectorizer.py:881: UserWarning:
# Value 'a' could not be converted to inferred type float64 in column 'a'.
# Such values will be replaced by NaN.
#  X = self._apply_cast(X)
#  array([[nan],
#       [nan]])

This is dangerous because it might silence critical errors for the user.
We either need to keep the data and let downstream estimators raise an error or raise it ourselves.

Gael Varoquaux · Answer 1 · Thu Nov 23 2023 21:01:36 GMT+0800 (China Standard Time)

Why do we need to replace np.nan with "missing"?

For the GapEncoder, missing values should probably be all encoded as vectors of zeros.

2. Replacing non-numeric by np.nan when trained on numeric

Not sure to avoid this. We could raise an error, but by default, I think that we should warn and have an option (set in the init) to enable error at predict.

This is dangerous because it might silence critical errors for the user.

Right, but erroring at predict can mean crashing the production (or prediction :D) server

We either need to keep the data and let downstream estimators raise an error or raise it ourselves.

We cannot: the dtype does not allow this.

Vincent M · Answer 2 · Fri Nov 24 2023 22:18:13 GMT+0800 (China Standard Time)

For the GapEncoder, missing values should probably be all encoded as vectors of zeros.

Does this imply letting the transformers deal with the missing categorical values themselves?

Right, but erroring at predict can mean crashing the production (or prediction :D) server

I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.

We cannot: the dtype does not allow this.

What do you mean? Since we trigger a warning, we could raise an error, couldn't we?

Gael Varoquaux · Answer 3 · Fri Nov 24 2023 22:23:26 GMT+0800 (China Standard Time)

Does this imply letting the transformers deal with the missing categorical values themselves?

I think it does: some transformers can deal with missing values naturally.

I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.

skrub.set_config, a la sklearn.set_config (which can be used as a context manager, an important detail)

We cannot: the dtype does not allow this. What do you mean? Since we trigger a warning, we could raise an error, couldn't we?

What I meant is that we cannot pass through strings without modifying.

Vincent M · Answer 4 · Mon Nov 27 2023 18:39:23 GMT+0800 (China Standard Time)

After an IRL meeting, we decided to:

Remove the imputing logic entirely
Leave the "Replacing non-numeric by np.nan when trained on numeric" issue as it is, and see in the medium term what is the most robust configuration for a scikit-learn pipeline. Our objective is to have an option where the pipeline never crashes.

Vincent M · Answer 5 · Tue Dec 12 2023 22:53:00 GMT+0800 (China Standard Time)

Solved by #819