`.predict` returns class indices (integers), not class labels; `.score` doesn't work
Gabriel-Kissin opened this issue · comments
The RuleFitClassifier .predict
method returns class indices (integers). To be consistent with sklearn, it should return class labels.
This is not only a question of consistency. This has a knock-on effect that the .score
method doesn't work when y
isn't a vector of 0,1 or False,True. This is because y_pred is supplied by the .predict
method, and is a vector of 0,1s; whereas y_true will be the class labels - raising TypeError
.
Example:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import imodels
import sklearn.model_selection
iris = load_iris(as_frame=True)
X = iris['data']
y = iris['target']
# RuleFitClassifier only supports binary classification, so restrict to 2 classes
X = X[y.isin([0,2])]
y = y[y.isin([0,2,])]
y = y.map(dict(enumerate(iris['target_names'])))
# y = y.astype(bool)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=.05)
lrc = LogisticRegression(random_state=0, max_iter=1000)
lrc.fit(X_train, y_train)
lrc_preds = lrc.predict(X_test)
print(lrc_preds)
print(y_test.values)
print(lrc.score(X_test, y_test))
print()
rfc = imodels.RuleFitClassifier()
rfc.fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)
print(rfc_preds)
print(y_test.values)
print(rfc.score(X_test, y_test))
Output:
['setosa' 'virginica' 'setosa' 'setosa' 'setosa']
['setosa' 'virginica' 'setosa' 'setosa' 'setosa']
1.0
[0 1 0 0 0]
['setosa' 'virginica' 'setosa' 'setosa' 'setosa']
Error:
TypeError: Labels in y_true and y_pred should be of the same type. Got y_true=['setosa' 'virginica'] and y_pred=[0 1]. Make sure that the predictions provided by the classifier coincides with the true labels.
So whereas the sklearn LogisticRegression works perfectly, the RuleFitClassifier errors as a result of this inconsistency.
This appears to be an issue across imodels, not just with RuleFitClassifier. Replacing rfc = imodels.RuleFitClassifier()
with rfc = imodels.FIGSClassifier()
in the above example causes the identical error. And the same with any of the other Binary Classification models listed here.
Looking at the source code the issue appears to be here
return np.argmax(self.predict_proba(X), axis=1)
which returns the column-wise argmax of the probabilities, indices, rather than labels.
This can easily be fixed by changing said line to
return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
temporary fix:
class CorrectedRuleFitClassifier(imodels.RuleFitClassifier):
def __init__(self, *args, **kwargs):
super().__init__( *args, **kwargs)
def predict(self, X):
pred_idxs = super().predict(X)
preds = self.classes_[pred_idxs]
return preds
Using that:
rfc = CorrectedRuleFitClassifier()
rfc.fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)
print(rfc_preds)
print(y_test.values)
print(rfc.score(X_test, y_test))
Gives us the expected output:
['virginica' 'virginica' 'setosa' 'virginica' 'virginica']
['virginica' 'virginica' 'setosa' 'virginica' 'virginica']
1.0
Because the predict method is fixed, the score method automatically begins working again.
Note that this is for RuleFitClassifier, but the same approach should work for any of the other classifiers which share this issue.
Thanks for your interest in the package and for pointing this out! I've just pushed a fix for this in 50431c8, that should resolve this (and bumped the imodels version, so pip install --upgrade imodels
should install the version with the fix).