marcotcr / anchor

Code for "High-Precision Model-Agnostic Explanations" paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AnchorTabularExplainer without categorical features

asstergi opened this issue · comments

Hi @marcotcr ,

Firstly, the paper is great and I'm really looking forward to using the package.

I tried to use it on my own data where the AnchorTabularExplainer() object does not have any categorical_names (i.e. categorical features). I see that the code when calling the explain_instance() method goes to https://github.com/marcotcr/anchor/blob/master/anchor/anchor_tabular.py#L215 and since there are no categorical features, the mapping dict remains empty and so the method is not working.

Am I missing something? Or, is there something I can do to overcome this?

Hello,
I'm glad you found the paper interesting.
You are not missing something, this is a bug in the code.
The anchor method needs categorical data, so I used to have a discretizer in the __init__ method for when the model uses numerical features. To be clear: the black box model can use continuous data, but the resulting anchor will be in discretized bins, such as "If Salary > 5000, predict X".

I must have removed that at some point and forgotten to put it back in.
I'll try to add it back soon, thanks for letting me know.

In the meantime, you can discretize your data first, similar to what I do here

Hi @marcotcr,

I discretized the data and got anchor working, thank you!

However, I'm seeing some inconsistencies in the reported coverage and precision when I try to use the anchor explanation on the original dataset (i.e. before the discretization).

Not sure if you can help just by looking at this code, but here's what I'm doing:
`
print('Anchor: %s' % (' AND '.join(exp.names())))

fit_anchor = np.where(np.all(X_trans_test_disc[:, exp.features()] == X_trans_test_disc[idx][exp.features()], axis=1))[0]
print('Anchor test coverage: %.4f' % (fit_anchor.shape[0] / float(X_trans_test_disc.shape[0])))
print('Anchor test precision: %.4f' % (np.mean(predict_fn(X_trans_test_disc[fit_anchor]) == predict_fn(X_trans_test_disc[idx].reshape(1, -1)))))

anch = y_trans[(X_trans['this_race_last_year_result'] > 1.50) & 
             (X_trans['grid'] > -9.50) & 
             (X_trans['grid'] <= -5.50)]
print ('Anchor test coverage (orig): %.4f' % (1.0*anch.shape[0]/y_trans.shape[0]))
print ('Anchor test precision (orig): %.4f' % (1.0*anch.sum()/anch.shape[0]))`

And here's the output:

Anchor: -9.50 < grid <= -5.50 AND this_race_last_year_result > 1.50

Anchor test coverage: 0.0316
Anchor test precision: 1.0000

Anchor test coverage (orig): 0.0486
Anchor test precision (orig): 0.8527

I would expect the figures to match. Any idea on this?

If the validation and test distributions are similar, the numbers should match. I would have to see it in more detail to understand if your discretization is doing something or if there's a bug in the code. I can take a look if you can share a notebook.

The newest version I uploaded has discretizing built in, you may want to give it a try.
It may be buggy since I didn't test it throughly, it may be safer to train a classifier on discretized data like you're doing.

Hello @marcotcr,
I am also trying to use numerical features.
You suggested to discretize the data before giving it to AnchorTabularExplainer right?
How will the AnchorTabularExplainer know to inverse discretize the data to get predictions on the pertubed samples?

If you discretize the data before you give it to AnchorTabularExplainer, you would have to learn the model on discretized features. If you want the black box model to use numerical features, you have to use the newest version with built in discretizing.

Hi there.
I found the same problem and used the following workaround, which works fine for me.
In the file anchor_tabular.py add an else clause to the __init__ method of class AnchorTabularExplainer

 class AnchorTabularExplainer(object):

    ... original code ...

    def __init__(self, class_names, feature_names, data=None,

        ... original code ...

        if categorical_names:
            # TODO: Check if this n_values is correct!!
            cat_names = sorted(categorical_names.keys())
            n_values = [len(categorical_names[i]) for i in cat_names]
            self.encoder = sklearn.preprocessing.OneHotEncoder(
                categorical_features=cat_names,
                n_values=n_values)
            self.encoder.fit(data)
            self.categorical_features = self.encoder.categorical_features
        else:  ## Allow for datasets without categorical names
            categorical_names = {}

        ... original code ...

This will prevent the update to fail and allow for discretization of your numerical variables within the explainer.

The anchor method needs categorical data, so I used to have a discretizer in the __init__ method for when the model uses numerical features. To be clear: the black box model can use continuous data, but the resulting anchor will be in discretized bins, such as "If Salary > 5000, predict X".

I must have removed that at some point and forgotten to put it back in.
I'll try to add it back soon, thanks for letting me know.

Has this been fixed in the code? Or we still have to do the workaround?
Never mind, I figured it out. I had to fit the classifier too, not only the explainer.

Thanks,
Amr

@eindzl Thanks, I also had the same problem and now it works correctly after your update .

Hi there.
I found the same problem and used the following workaround, which works fine for me.
In the file anchor_tabular.py add an else clause to the __init__ method of class AnchorTabularExplainer

 class AnchorTabularExplainer(object):

    ... original code ...

    def __init__(self, class_names, feature_names, data=None,

        ... original code ...

        if categorical_names:
            # TODO: Check if this n_values is correct!!
            cat_names = sorted(categorical_names.keys())
            n_values = [len(categorical_names[i]) for i in cat_names]
            self.encoder = sklearn.preprocessing.OneHotEncoder(
                categorical_features=cat_names,
                n_values=n_values)
            self.encoder.fit(data)
            self.categorical_features = self.encoder.categorical_features
        else:  ## Allow for datasets without categorical names
            categorical_names = {}

        ... original code ...

This will prevent the update to fail and allow for discretization of your numerical variables within the explainer.

Will this workaround be implemented at some point?