`AttributeError` in `SimilarityEncoder` `inverse_transform`
jeromedockes opened this issue · comments
Describe the bug
calling inverse_transform
on a SimilarityEncoder
raises an AttributeError
Steps/Code to Reproduce
from skrub import SimilarityEncoder
encoder = SimilarityEncoder()
X = [["A"], ["B"]]
encoder.fit(X)
encoder.inverse_transform([[1.0, 1.0], [1.0, 1.0]])
Expected Results
no error
Actual Results
Traceback (most recent call last):
File "/tmp/12-113221.py", line 7, in
encoder.inverse_transform([[1.0, 1.0], [1.0, 1.0]])
File "/home/jerome/workspace/backedup_repositories/scikit-learn/sklearn/preprocessing/_encoders.py", line 1100, in inverse_transform
n_features_out = np.sum(self._n_features_outs)
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'SimilarityEncoder' object has no attribute 'n_features_outs'. Did you mean: 'n_features_in'?
Versions
System:
python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
machine: Linux-6.2.0-33-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.4.dev0
pip: 23.2.1
setuptools: 65.5.0
numpy: 1.26.0
scipy: 1.11.2
Cython: None
pandas: 2.1.0
matplotlib: 3.7.3
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Prescott
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
version: 0.3.21.dev
threading_layer: pthreads
architecture: Prescott
user_api: openmp
internal_api: openmp
num_threads: 4
prefix: libgomp
filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
version: None
0.0.1.dev0
I think something like
diff --git a/skrub/_similarity_encoder.py b/skrub/_similarity_encoder.py
index 6e9a67d0..9ab3eb15 100644
--- a/skrub/_similarity_encoder.py
+++ b/skrub/_similarity_encoder.py
@@ -417,6 +417,7 @@ class SimilarityEncoder(OneHotEncoder):
else:
self.drop_idx_ = self._compute_drop_idx()
+ self._n_features_outs = self._compute_n_features_outs()
return self
def transform(self, X: ArrayLike, fast: bool = True) -> NDArray:
or equivalently
diff --git a/skrub/_similarity_encoder.py b/skrub/_similarity_encoder.py
index 6e9a67d0..7eda8b40 100644
--- a/skrub/_similarity_encoder.py
+++ b/skrub/_similarity_encoder.py
@@ -417,6 +417,7 @@ class SimilarityEncoder(OneHotEncoder):
else:
self.drop_idx_ = self._compute_drop_idx()
+ self._n_features_outs = list(map(len, self.categories_))
return self
def transform(self, X: ArrayLike, fast: bool = True) -> NDArray:
would fix it. However it seems the SimilarityEncoder may have the same problem as the TableVectorizer had -- it is subclassing the OneHotEncoder which doesn't look like it is meant to be subclassed, and it requires it to understand, manipulate and use undocumented private attributes and functions of the OneHotEncoder, and also to expose a drop_idx_
which AFAICT is always None
and does not really make sense for the SimilarityEncoder which doesn't have a "drop" parameter