Unable to transform test data after MCA fitting training data
anishafluffy opened this issue · comments
Hi, I'm having an issue that I've seen other people post issues about before. I'm unable to transform a test dataset after MCA is fit on a training dataset. It seems to be a shape issue. Here is the code to recreate the error.
- Create example dataframe
X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])
- Fit on training data
from prince import MCA
mca = MCA()
mca.fit(X[:4])
- Transform on test data
mca.transform(X[4:])
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-8524ce01518f> in <module>
3 mca.fit(X[:4])
4
----> 5 mca.transform(X[4:])
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in transform(self, X)
48 if self.check_input:
49 utils.check_array(X, dtype=[str, np.number])
---> 50 return self.row_coordinates(X)
51
52 def plot_coordinates(self, X, ax=None, figsize=(6, 6), x_component=0, y_component=1,
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/mca.py in row_coordinates(self, X)
36 if not isinstance(X, pd.DataFrame):
37 X = pd.DataFrame(X)
---> 38 return super().row_coordinates(pd.get_dummies(X))
39
40 def column_coordinates(self, X):
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/prince/ca.py in row_coordinates(self, X)
132
133 return pd.DataFrame(
--> 134 data=X @ sparse.diags(self.col_masses_.to_numpy() ** -0.5) @ self.V_.T,
135 index=row_names
136 )
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmatmul__(self, other)
568 raise ValueError("Scalar operands are not allowed, "
569 "use '*' instead")
--> 570 return self.__rmul__(other)
571
572 ####################
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __rmul__(self, other)
552 except AttributeError:
553 tr = np.asarray(other).transpose()
--> 554 return (self.transpose() * tr).transpose()
555
556 #####################################
~/.conda/envs/nonrootenv/lib/python3.7/site-packages/scipy/sparse/base.py in __mul__(self, other)
518
519 if other.shape[0] != self.shape[1]:
--> 520 raise ValueError('dimension mismatch')
521
522 result = self._mul_multivector(np.asarray(other))
ValueError: dimension mismatch
If the train and test datasets are the same shape like below, I get no error.
mca = MCA()
mca.fit(X[:3])
mca.transform(X[3:])
My guess is that with prince
you can apply the transform
function only on the same data set that you used to fit
the model.
See if this helps:
To apply clustering on a test set, you might want to perform an additional logistic regression as a workaround:
-
Same you did
-
Same you did
-
Transform the train set
X_train_transformed = mca.transform(X_train)
-
Use your favorite technique to determine the number of clusters you want to create; plotting a dendrogram is one such technique.
-
Perform clustering on the transformed data set:
from scipy.cluster.hierarchy import fcluster, linkage
cluster_nums = fcluster(
linkage(
y=X_train_transformed,
method='ward',
metric='euclidean'
),
t=5, # here goes the max number of clusters you want to create
criterion='maxclust'
)
y_train = cluster_nums.astype(str)
- Perform one-hot label encoding (dummy label encoding) on both
X_train
andX_test
data sets:
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies = pd.get_dummies(X_test)
- Perform logistic regression on
X_train_dummies
:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_iter': [10000],
'multi_class': [
'auto',
# 'multinomial',
# 'ovr'
],
'penalty': [
'l1',
# 'l2',
# 'elasticnet'
],
'solver': [
# 'newton-cg',
# 'sag',
'saga',
# 'lbfgs',
# 'liblinear'
],
'C': np.logspace(-5, 5, 25)
}
clf_log_reg = GridSearchCV(
LogisticRegression(),
param_grid=param_grid,
cv=5,
verbose=True,
n_jobs=-1
).fit(X_train_dummies, y_train)
- Apply the results on the test set
X_test['cluster'] = clf_log_reg.best_estimator_.predict(X_test_dummies).astype(str)
I hope this helps
I'm also facing this issue and have not been able to overcome it yet
Its usefulness decreases if we can't apply it on another test set
@kirisakow your answer is a good idea, but not simple as calling transform(), which is what one would expect
@anishafluffy
I have been able to overcome this issue.
After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24).
The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.
The solution was:
1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies
2 - Save pd.get_dummies metadata
3 - Onehotencode the test set with pd.get_dummies
4 - Drop unseen labels based on saved metadata
5 - Create encoded columns not present in the test set, filling the columns with 0
6 - Use mca.transform() method. Now it works!
Hello there 👋
I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.
I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.
@anishafluffy I have been able to overcome this issue.
After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.
The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!
good shout, one of the other options is just to use model's objects's MCA.active_cols instance and then append columns that are not present in the test dataset and set with 0
@anishafluffy I have been able to overcome this issue.
After inspecting mca.py in the repo, I've noticed that one of the first things is the one hot encoding with pd.get_dummies(X) (line 24). The dimension mismatch error occurs when you have unseen labels on the test set, or categories that are not present.
The solution was: 1 - Onehotencode my dataset before using mca.fit() with pd.get_dummies 2 - Save pd.get_dummies metadata 3 - Onehotencode the test set with pd.get_dummies 4 - Drop unseen labels based on saved metadata 5 - Create encoded columns not present in the test set, filling the columns with 0 6 - Use mca.transform() method. Now it works!
This is my version of the initial example after following cnmoro's approach:
import pandas as pd
import prince # 0.13.0
X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])
# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)
# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)
unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
Xtd[name] = 0
from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)
@vcuziol Please annotate your code block properly in order to activate syntax highlighting, like this:
import pandas as pd
import prince # 0.13.0
X = pd.DataFrame(
data=[
['A', 'A', 'A', 'm'],
['A', 'A', 'A', 'f'],
['B', 'A', 'B', 'm'],
['B', 'A', 'B', 'm'],
['B', 'B', 'B', 'f'],
['B', 'B', 'A', 'f']
],
columns=['feature1', 'feature2', 'feature3', 'feature4'])
# training data
X0 = X[:4]
Xd = pd.get_dummies(X0)
# test data
X_test = X[4:]
Xtd = pd.get_dummies(X_test)
unseen_labels = [x for x in Xtd.columns if x not in Xd.columns]
new_cols = [col for col in Xd.columns if col not in Xtd.columns]
for name in unseen_labels + new_cols:
Xtd[name] = 0
from prince import MCA
mca = MCA(one_hot=False)
mca = mca.fit(Xd)
mca.transform(Xtd)