MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA

Home Page:https://maxhalford.github.io/prince

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can not handle Categorical in FAMD

kabirmdasraful opened this issue · comments

I am using the latest version of Pandas and Prince.
When I run the following example it does not work

df = pd.DataFrame(
{'variable_1': [4, 5, 6, 7, 11, 2, 52],
'variable_2': [10, 20, 30, 40, 10, 74, 10],
'variable_3': [100, 50, 30, 50, 19, 29, 20],
'color': ['red', 'blue', 'green', 'blue', 'red', 'red', 'blue']
})

df['color']=df['color'].astype('category')
model = prince.FAMD(
n_components = 2,
copy = True,
check_input = True,
engine = 'auto',
random_state = 1
)
model.fit(df)

ValueError: Not all columns in "Categorical" group are of the same type

I have also analyzed the reason why it occurs.
When it call fit of mfa it checks whether it is categorical or not by the following code:

   for name, cols in sorted(self.groups.items()):
        all_num = all(pd.api.types.is_numeric_dtype(X[c]) for c in cols)
        all_cat = all(pd.api.types.is_string_dtype(X[c]) for c in cols)
        if not (all_num or all_cat):
            raise ValueError('Not all columns in "{}" group are of the same type'.format(name))

This was ok for earlier version of pandas. But now all_cat = all(pd.api.types.is_string_dtype(X[c]) for c in cols) this part does not works for Categorical data but only for object data.

So above part probably need to be corrected by following:

   for name, cols in sorted(self.groups.items()):
        all_num = all(pd.api.types.is_numeric_dtype(X[c]) for c in cols)
        all_obj= all(pd.api.types.is_string_dtype(X[c]) for c in cols)
        all_cat= all(pd.api.types.is_categorical_dtype(X[c]) for c in cols)
        if not (all_num or all_obj or all_cat):
            raise ValueError('Not all columns in "{}" group are of the same type'.format(name))

Am I right ?

Nice solution. After implementing it, I got an exception due to the for-loop directly after this in mfa.py.

You referenced this in your issue:

prince/prince/mfa.py

Lines 45 to 52 in 988f7fe

# Check group types are consistent
self.all_nums_ = {}
for name, cols in sorted(self.groups.items()):
all_num = all(pd.api.types.is_numeric_dtype(X[c]) for c in cols)
all_cat = all(pd.api.types.is_string_dtype(X[c]) for c in cols)
if not (all_num or all_cat):
raise ValueError('Not all columns in "{}" group are of the same type'.format(name))
self.all_nums_[name] = all_num

I implemented your fix (important to keep self.all_nums_[name] = all_num), and got a new exception referencing the for-loop after this, here:

prince/prince/mfa.py

Lines 54 to 75 in 988f7fe

# Run a factor analysis in each group
self.partial_factor_analysis_ = {}
for name, cols in sorted(self.groups.items()):
if self.all_nums_[name]:
fa = pca.PCA(
rescale_with_mean=False,
rescale_with_std=False,
n_components=self.n_components,
n_iter=self.n_iter,
copy=True,
random_state=self.random_state,
engine=self.engine
)
else:
fa = mca.MCA(
n_components=self.n_components,
n_iter=self.n_iter,
copy=self.copy,
random_state=self.random_state,
engine=self.engine
)
self.partial_factor_analysis_[name] = fa.fit(X.loc[:, cols])

The problem is that when running FAMD self.all_nums_ is a dictionary with only 'Numerical' as key, if I understand the code correct. However self.groups was created in famd.py which is a dictionary with both 'Numerical' and 'Categorical' as keys, so when it tries to run the check if self.all_nums_[name] with name as Categorical it throws a KeyError exception since it only has 'Numerical' in keys.

I changed
if self.all_nums_[name]:
to
if name =='Numerical':

This however is only a quick fix, and probably not a robust way of fixing it.

I am still having this error even after code changes. Tried PCA and MCA from this package separately on my data and it worked normally. I do not have any idea how to solve it.

I would like to inquire about the status of this issue.

I tried converting my Categorical variables to string type in order to fit the old Pandas API. However, the code then breaks on the one-hot encoding with get_dummies() due to strings not being supported by that method (it does nothing) in newer Pandas versions.

Maybe the README could be updated in order to warn that the newer Pandas versions can cause issues. The last supported version could be added to the requirements until this issue has been solved.

Hello,

I am also having this issue, and it is not solved when changing
if self.all_nums_[name]:
to
if name =='Numerical':

Any idea of what I should change to make the code work correctly?
Thank you,

Same issue here with pandas 1.2.4 and prince 0.7.1

If someone is wondering, a temporary fix is to switch from dtype category to dtype object

df['color'] = df['color'].astype('object')

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.