MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page:https://maxhalford.github.io/prince

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error while applying .transform()

nico695 opened this issue · comments

Same error that has been documented in here #56.

Tried downgrading the version to 0.7.0 through the repository that was linked in that thread. Still showing the same dimensionality error.

Here its the code:

import numpy as np
import pandas as pd
X_n = pd.DataFrame(data=np.random.rand(10000,2),columns=list('AB'))
X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10000,4),replace=True),columns =list('CDEF'))
X=pd.concat([X_n,X_c],axis=1)

from prince import FAMD

famd = FAMD(n_components = 6, n_iter = 100)
famd.fit(X)

famd.transform(X.iloc[1:10,:])

I got the same error in version 0.7.0 and 0.7.1

ValueError: shapes (9,20) and (22,6) not aligned: 20 (dim 1) != 22 (dim 0)

I've run into this issue a few times and it looks like it's based on how dummies are generated in _build_X_global. When the dataset you are transforming does not have examples of all the categorical variables from the larger original dataset, the resulting dummified dataset has fewer columns (in this case, 20 rather than 22).

Suggested fix for this (and #56 and #116) is to store the dummified columns in the famd and mfa models. If a new dataset being transformed only has a subset of categorical values, then its dummified dataset should have the right number of columns and one or more columns will be all zeroes. If a new dataset being transformed has new categorical values, should probably throw an error.

Had the same issue, so had to make sure my train, validation, and test have examples of all the categorical variables, before fitting MCA. And dump columns where they don't:

keep = []
for clmn in X_train_cat.columns:
    train_cats = set(X_val_train_cat[clmn].unique())
    val_cats = set(X_val_test_cat[clmn].unique())
    test_cats = set(X_test_cat[clmn].unique())
    keep.append(train_cats == val_cats == test_cats)

keep_columns = X_train_cat.columns[keep]

But that's obviously an awkward temp solution, just to make it work. The dummy matrix workaround @christophe-williams mentioned would be nice to have.

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.