csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).

Home Page:https://csinva.io/imodels

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`.predict` returns class indices (integers), not class labels; `.score` doesn't work

Gabriel-Kissin opened this issue · comments

The RuleFitClassifier .predict method returns class indices (integers). To be consistent with sklearn, it should return class labels.

This is not only a question of consistency. This has a knock-on effect that the .score method doesn't work when y isn't a vector of 0,1 or False,True. This is because y_pred is supplied by the .predict method, and is a vector of 0,1s; whereas y_true will be the class labels - raising TypeError.

Example:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import imodels
import sklearn.model_selection

iris = load_iris(as_frame=True)
X = iris['data']
y = iris['target']

# RuleFitClassifier only supports binary classification, so restrict to 2 classes
X = X[y.isin([0,2])]
y = y[y.isin([0,2,])]

y = y.map(dict(enumerate(iris['target_names'])))
# y = y.astype(bool)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=.05)

lrc = LogisticRegression(random_state=0, max_iter=1000)
lrc.fit(X_train, y_train)
lrc_preds = lrc.predict(X_test)
print(lrc_preds)
print(y_test.values)
print(lrc.score(X_test, y_test))
print()

rfc = imodels.RuleFitClassifier()
rfc.fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)

print(rfc_preds)
print(y_test.values)
print(rfc.score(X_test, y_test))

Output:

['setosa' 'virginica' 'setosa' 'setosa' 'setosa']
['setosa' 'virginica' 'setosa' 'setosa' 'setosa']
1.0

[0 1 0 0 0]
['setosa' 'virginica' 'setosa' 'setosa' 'setosa']

Error:

TypeError: Labels in y_true and y_pred should be of the same type. Got y_true=['setosa' 'virginica'] and y_pred=[0 1]. Make sure that the predictions provided by the classifier coincides with the true labels.

So whereas the sklearn LogisticRegression works perfectly, the RuleFitClassifier errors as a result of this inconsistency.

This appears to be an issue across imodels, not just with RuleFitClassifier. Replacing rfc = imodels.RuleFitClassifier() with rfc = imodels.FIGSClassifier() in the above example causes the identical error. And the same with any of the other Binary Classification models listed here.

Looking at the source code the issue appears to be here

return np.argmax(self.predict_proba(X), axis=1)

which returns the column-wise argmax of the probabilities, indices, rather than labels.

This can easily be fixed by changing said line to

return self.classes_[np.argmax(self.predict_proba(X), axis=1)]

temporary fix:

class CorrectedRuleFitClassifier(imodels.RuleFitClassifier):
    def __init__(self, *args, **kwargs):
        super().__init__( *args, **kwargs)

    def predict(self, X):
        pred_idxs = super().predict(X)
        preds = self.classes_[pred_idxs]
        return preds

Using that:

rfc = CorrectedRuleFitClassifier()
rfc.fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)

print(rfc_preds)
print(y_test.values)
print(rfc.score(X_test, y_test))

Gives us the expected output:

['virginica' 'virginica' 'setosa' 'virginica' 'virginica']
['virginica' 'virginica' 'setosa' 'virginica' 'virginica']
1.0

Because the predict method is fixed, the score method automatically begins working again.

Note that this is for RuleFitClassifier, but the same approach should work for any of the other classifiers which share this issue.

Thanks for your interest in the package and for pointing this out! I've just pushed a fix for this in 50431c8, that should resolve this (and bumped the imodels version, so pip install --upgrade imodels should install the version with the fix).