TableVectorizer imputing logic is confusing
Vincent-Maladiere opened this issue · comments
While working on #819, I found that the imputing behavior we currently use in the TableVectorizer.auto_cast
for categorical is intriguing.
1. Two ways imputing
As stated by this comment, we currently:
- Replace the "almost missing" strings with
np.nan
for all columns dtypes - Then, for non-numeric dtypes (string, categorical, and object), we do the opposite: we replace
np.nan
with the string "missing"
Why do we need to replace np.nan
with "missing"
?
2. Replacing non-numeric by np.nan
when trained on numeric
When trained on a numerical column, TableVectorizer
replaces the string/object dtypes of this column with np.nan
during predict:
import numpy as np
import pandas as pd
from skrub import TableVectorizer
df_train = pd.DataFrame(dict(a=[np.nan, 1]))
df_test = pd.DataFrame(dict(a=["a", "b"]))
tv = TableVectorizer().fit(df_train)
tv.transform(df_test)
# /Users/vincentmaladiere/INRIA/skrub/skrub/_table_vectorizer.py:881: UserWarning:
# Value 'a' could not be converted to inferred type float64 in column 'a'.
# Such values will be replaced by NaN.
# X = self._apply_cast(X)
# array([[nan],
# [nan]])
This is dangerous because it might silence critical errors for the user.
We either need to keep the data and let downstream estimators raise an error or raise it ourselves.
For the GapEncoder, missing values should probably be all encoded as vectors of zeros.
Does this imply letting the transformers deal with the missing categorical values themselves?
Right, but erroring at predict can mean crashing the production (or prediction :D) server
I like this thought a lot because it paves the way for a config file "local or staging vs production" where production would be more permissive to avoid crashing. This could be enabled across all skrub.
We cannot: the dtype does not allow this.
What do you mean? Since we trigger a warning, we could raise an error, couldn't we?
After an IRL meeting, we decided to:
- Remove the imputing logic entirely
- Leave the "Replacing non-numeric by np.nan when trained on numeric" issue as it is, and see in the medium term what is the most robust configuration for a scikit-learn pipeline. Our objective is to have an option where the pipeline never crashes.