MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page:https://maxhalford.github.io/prince

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to transform test data after MCA fitting training data

anishafluffy opened this issue · comments

Hi, I'm having an issue that I've seen other people post issues about before. I'm unable to transform a test dataset after MCA is fit on a training dataset. It seems to be a shape issue. Here is the code to recreate the error.

  1. Create example dataframe
X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])
  1. Fit on training data
from prince import MCA
mca = MCA()
mca.fit(X[:4])
  1. Transform on test data
mca.transform(X[4:])

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-8524ce01518f> in <module>
      3 mca.fit(X[:4])
      4 
----> 5 mca.transform(X[4:])

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in transform(self, X)
     48         if self.check_input:
     49             utils.check_array(X, dtype=[str, np.number])
---> 50         return self.row_coordinates(X)
     51 
     52     def plot_coordinates(self, X, ax=None, figsize=(6, 6), x_component=0, y_component=1,

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in row_coordinates(self, X)
     36         if not isinstance(X, pd.DataFrame):
     37             X = pd.DataFrame(X)
---> 38         return super().row_coordinates(pd.get_dummies(X))
     39 
     40     def column_coordinates(self, X):

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/ca.py in row_coordinates(self, X)
    132 
    133         return pd.DataFrame(
--> 134             data=X @ sparse.diags(self.col_masses_.to_numpy() ** -0.5) @ self.V_.T,
    135             index=row_names
    136         )

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmatmul__(self, other)
    568             raise ValueError("Scalar operands are not allowed, "
    569                              "use '*' instead")
--> 570         return self.__rmul__(other)
    571 
    572     ####################

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmul__(self, other)
    552             except AttributeError:
    553                 tr = np.asarray(other).transpose()
--> 554             return (self.transpose() * tr).transpose()
    555 
    556     #####################################

~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __mul__(self, other)
    518 
    519             if other.shape[0] != self.shape[1]:
--> 520                 raise ValueError('dimension mismatch')
    521 
    522             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

If the train and test datasets are the same shape like below, I get no error.

mca = MCA()
mca.fit(X[:3])
mca.transform(X[3:])

My guess is that with prince you can apply the transform function only on the same data set that you used to fit the model.

See if this helps:

To apply clustering on a test set, you might want to perform an additional logistic regression as a workaround:

  1. Same you did

  2. Same you did

  3. Transform the train set

X_train_transformed = mca.transform(X_train)
  1. Use your favorite technique to determine the number of clusters you want to create; plotting a dendrogram is one such technique.

  2. Perform clustering on the transformed data set:

from scipy.cluster.hierarchy import fcluster, linkage

cluster_nums = fcluster(
    linkage(
        y=X_train_transformed,
        method='ward',
        metric='euclidean'
    ),
    t=5, # here goes the max number of clusters you want to create
    criterion='maxclust'
)

y_train = cluster_nums.astype(str)
  1. Perform one-hot label encoding (dummy label encoding) on both X_train and X_test data sets:
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies = pd.get_dummies(X_test)
  1. Perform logistic regression on X_train_dummies:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_iter': [10000],
    'multi_class': [
        'auto',
        # 'multinomial',
        # 'ovr'
    ],
    'penalty': [
        'l1',
        # 'l2',
        # 'elasticnet'
    ],
    'solver': [
        # 'newton-cg',
        # 'sag',
        'saga',
        # 'lbfgs',
        # 'liblinear'
    ],
    'C': np.logspace(-5, 5, 25)
}
clf_log_reg = GridSearchCV(
  LogisticRegression(),
  param_grid=param_grid,
  cv=5,
  verbose=True,
  n_jobs=-1
).fit(X_train_dummies, y_train)
  1. Apply the results on the test set
X_test['cluster'] = clf_log_reg.best_estimator_.predict(X_test_dummies).astype(str)

I hope this helps

I'm also facing this issue and have not been able to overcome it yet
Its usefulness decreases if we can't apply it on another test set

@kirisakow your answer is a good idea, but not simple as calling transform(), which is what one would expect

@anishafluffy
I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24).
The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was:
1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies
2 - Save pd.get_dummies metadata
3 - Onehotencode the test set with pd.get_dummies
4 - Drop unseen labels based on saved metadata
5 - Create encoded columns not present in the test set, filling the columns with 0
6 - Use mca.transform() method. Now it works!

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.

@anishafluffy I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!

good shout, one of the other options is just to use model's objects's MCA.active_cols instance and then append columns that are not present in the test dataset and set with 0

@anishafluffy I have been able to overcome this issue.

After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.

The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!

This is my version of the initial example after following cnmoro's approach:

import pandas as pd
import prince # 0.13.0

X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])

# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)

# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)

unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
    Xtd[name] = 0

from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)

@vcuziol Please annotate your code block properly in order to activate syntax highlighting, like this:

import pandas as pd
import prince # 0.13.0

X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])

# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)

# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)

unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
    Xtd[name] = 0

from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)