Unable to transform test data after MCA fitting training data

Question

Unable to transform test data after MCA fitting training data

anishafluffy opened this issue 4 years ago · comments

Hi, I'm having an issue that I've seen other people post issues about before. I'm unable to transform a test dataset after MCA is fit on a training dataset. It seems to be a shape issue. Here is the code to recreate the error.

Create example dataframe

X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])

Fit on training data

from prince import MCA
mca = MCA()
mca.fit(X[:4])

Transform on test data

mca.transform(X[4:])

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-8524ce01518f> in <module>
      3 mca.fit(X[:4])
      4 
----> 5 mca.transform(X[4:])

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in transform(self, X)
     48         if self.check_input:
     49             utils.check_array(X, dtype=[str, np.number])
---> 50         return self.row_coordinates(X)
     51 
     52     def plot_coordinates(self, X, ax=None, figsize=(6, 6), x_component=0, y_component=1,

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in row_coordinates(self, X)
     36         if not isinstance(X, pd.DataFrame):
     37             X = pd.DataFrame(X)
---> 38         return super().row_coordinates(pd.get_dummies(X))
     39 
     40     def column_coordinates(self, X):

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/ca.py in row_coordinates(self, X)
    132 
    133         return pd.DataFrame(
--> 134             data=X @ sparse.diags(self.col_masses_.to_numpy() ** -0.5) @ self.V_.T,
    135             index=row_names
    136         )

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmatmul__(self, other)
    568             raise ValueError("Scalar operands are not allowed, "
    569                              "use '*' instead")
--> 570         return self.__rmul__(other)
    571 
    572     ####################

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmul__(self, other)
    552             except AttributeError:
    553                 tr = np.asarray(other).transpose()
--> 554             return (self.transpose() * tr).transpose()
    555 
    556     #####################################

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __mul__(self, other)
    518 
    519             if other.shape[0] != self.shape[1]:
--> 520                 raise ValueError('dimension mismatch')
    521 
    522             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

If the train and test datasets are the same shape like below, I get no error.

mca = MCA()
mca.fit(X[:3])
mca.transform(X[3:])

Kiril Isakov · Answer 1 · Wed Jan 27 2021 17:09:11 GMT+0800 (China Standard Time)

My guess is that with prince you can apply the transform function only on the same data set that you used to fit the model.

Kiril Isakov · Answer 2 · Wed Jan 27 2021 17:10:02 GMT+0800 (China Standard Time)

See if this helps:

To apply clustering on a test set, you might want to perform an additional logistic regression as a workaround:

Same you did
Same you did
Transform the train set

X_train_transformed = mca.transform(X_train)

Use your favorite technique to determine the number of clusters you want to create; plotting a dendrogram is one such technique.
Perform clustering on the transformed data set:

from scipy.cluster.hierarchy import fcluster, linkage

cluster_nums = fcluster(
    linkage(
        y=X_train_transformed,
        method='ward',
        metric='euclidean'
    ),
    t=5, # here goes the max number of clusters you want to create
    criterion='maxclust'
)

y_train = cluster_nums.astype(str)

Perform one-hot label encoding (dummy label encoding) on both X_train and X_test data sets:

X_train_dummies = pd.get_dummies(X_train)
X_test_dummies = pd.get_dummies(X_test)

Perform logistic regression on X_train_dummies:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_iter': [10000],
    'multi_class': [
        'auto',
        # 'multinomial',
        # 'ovr'
    ],
    'penalty': [
        'l1',
        # 'l2',
        # 'elasticnet'
    ],
    'solver': [
        # 'newton-cg',
        # 'sag',
        'saga',
        # 'lbfgs',
        # 'liblinear'
    ],
    'C': np.logspace(-5, 5, 25)
}
clf_log_reg = GridSearchCV(
  LogisticRegression(),
  param_grid=param_grid,
  cv=5,
  verbose=True,
  n_jobs=-1
).fit(X_train_dummies, y_train)

Apply the results on the test set

X_test['cluster'] = clf_log_reg.best_estimator_.predict(X_test_dummies).astype(str)

I hope this helps

Carlo Moro · Answer 3 · Thu Jul 22 2021 07:14:38 GMT+0800 (China Standard Time)

I'm also facing this issue and have not been able to overcome it yet
Its usefulness decreases if we can't apply it on another test set

@kirisakow your answer is a good idea, but not simple as calling transform(), which is what one would expect

Carlo Moro · Answer 4 · Wed Aug 04 2021 05:19:46 GMT+0800 (China Standard Time)

@anishafluffy
I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24).
The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was:
1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies
2 - Save pd.get_dummies metadata
3 - Onehotencode the test set with pd.get_dummies
4 - Drop unseen labels based on saved metadata
5 - Create encoded columns not present in the test set, filling the columns with 0
6 - Use mca.transform() method. Now it works!

Max Halford · Answer 5 · Mon Feb 27 2023 19:42:26 GMT+0800 (China Standard Time)

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.

ihormak · Answer 6 · Sun Jan 14 2024 18:20:53 GMT+0800 (China Standard Time)

@anishafluffy I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!

good shout, one of the other options is just to use model's objects's MCA.active_cols instance and then append columns that are not present in the test dataset and set with 0

Vitor V. Cuziol · Answer 7 · Wed May 15 2024 03:18:14 GMT+0800 (China Standard Time)

@anishafluffy I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!

This is my version of the initial example after following cnmoro's approach:

import pandas as pd
import prince # 0.13.0

X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])

# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)

# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)

unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
    Xtd[name] = 0

from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)

Kiril Isakov · Answer 8 · Wed May 15 2024 03:29:32 GMT+0800 (China Standard Time)

@vcuziol Please annotate your code block properly in order to activate syntax highlighting, like this:

import pandas as pd
import prince # 0.13.0

X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])

# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)

# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)

unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
    Xtd[name] = 0

from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)