skrub-data / skrub

Prepping tables for machine learning

Home Page:https://skrub-data.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`AttributeError` in `SimilarityEncoder` `inverse_transform`

jeromedockes opened this issue · comments

Describe the bug

calling inverse_transform on a SimilarityEncoder raises an AttributeError

Steps/Code to Reproduce

from skrub import SimilarityEncoder


encoder = SimilarityEncoder()
X = [["A"], ["B"]]
encoder.fit(X)
encoder.inverse_transform([[1.0, 1.0], [1.0, 1.0]])

Expected Results

no error

Actual Results

Traceback (most recent call last):
File "/tmp/12-113221.py", line 7, in
encoder.inverse_transform([[1.0, 1.0], [1.0, 1.0]])
File "/home/jerome/workspace/backedup_repositories/scikit-learn/sklearn/preprocessing/_encoders.py", line 1100, in inverse_transform
n_features_out = np.sum(self._n_features_outs)
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'SimilarityEncoder' object has no attribute 'n_features_outs'. Did you mean: 'n_features_in'?

Versions

System:
    python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
   machine: Linux-6.2.0-33-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.dev0
          pip: 23.2.1
   setuptools: 65.5.0
        numpy: 1.26.0
        scipy: 1.11.2
       Cython: None
       pandas: 2.1.0
   matplotlib: 3.7.3
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
        version: None
0.0.1.dev0

I think something like

diff --git a/skrub/_similarity_encoder.py b/skrub/_similarity_encoder.py
index 6e9a67d0..9ab3eb15 100644
--- a/skrub/_similarity_encoder.py
+++ b/skrub/_similarity_encoder.py
@@ -417,6 +417,7 @@ class SimilarityEncoder(OneHotEncoder):
         else:
             self.drop_idx_ = self._compute_drop_idx()
 
+        self._n_features_outs = self._compute_n_features_outs()
         return self
 
     def transform(self, X: ArrayLike, fast: bool = True) -> NDArray:

or equivalently

diff --git a/skrub/_similarity_encoder.py b/skrub/_similarity_encoder.py
index 6e9a67d0..7eda8b40 100644
--- a/skrub/_similarity_encoder.py
+++ b/skrub/_similarity_encoder.py
@@ -417,6 +417,7 @@ class SimilarityEncoder(OneHotEncoder):
         else:
             self.drop_idx_ = self._compute_drop_idx()
 
+        self._n_features_outs = list(map(len, self.categories_))
         return self
 
     def transform(self, X: ArrayLike, fast: bool = True) -> NDArray:

would fix it. However it seems the SimilarityEncoder may have the same problem as the TableVectorizer had -- it is subclassing the OneHotEncoder which doesn't look like it is meant to be subclassed, and it requires it to understand, manipulate and use undocumented private attributes and functions of the OneHotEncoder, and also to expose a drop_idx_ which AFAICT is always None and does not really make sense for the SimilarityEncoder which doesn't have a "drop" parameter