SHAP-FOLD-PY

A Python implementation of SHAP-FOLD algorithms.
The author of SHAP-FOLD also has an original implementation:
https://github.com/fxs130430/SHAP_FOLD
You can find the research paper and description in this repository.

Install

Prerequisites

SHAP-FOLD-PY is developed with only python3. Here are the dependencies:


  SHAP: python3 -m pip install shap
  XGBoost: python3 -m pip install xgboost
  sklearn: python3 -m pip install scikit-learn

If there are still missing libraries, you can use the same command to install them.

Instruction

Data preparation

The SHAP-FOLD algorithm takes binary tabular data as input, each column should be an independent binary feature.
Numeric features should be mapped to as few numerical intervals as possible.
We have prepared two different encoders in this repo for tabular data: one-hot encoder and decision tree encoder.

one-hot encoder can be used for simple / small datasets.
decision tree encoder can be used for larger datasets. Only the features with high information gain would be selected.

For example, the UCI acute dataset can be encoded to 55 features with one-hot encoding but only encoded to 6 features with decision tree encoding.
Here is an example:


from encoder import TreeEncoder, OneHotEncoder, save_data_to_file
attrs = ['a1', 'a2', 'a3', 'a4', 'a5', 'a6']
nums = ['a1']
encoder = OneHotEncoder(attrs=attrs, numerics=nums, label='label', pos='yes')
data = encoder.encode('data/acute/acute.csv')
save_data_to_file(data, 'data/acute/file1.csv')

encoder = TreeEncoder(attrs=attrs, numerics=nums, label='label', pos='yes')
data = encoder.encode('data/acute/acute.csv')
save_data_to_file(data, 'data/acute/file2.csv')

attrs lists all the features needed, numerics lists all the numeric features, label implies the feature name of the label, pos indicates the positive value of the label.

Training

The SHAP-FOLD algorithm generates an explainable model that is represented by an answer set program (Prolog augmented with negation-as-failure) to explain an existing classification model. Here's an example:


import xgboost
data, attrs = load_data('data/acute/file2.csv')
X, Y = split_xy(data)
X_train, Y_train, X_test, Y_test = split_data(X, Y, ratio=0.8)

model = xgboost.XGBClassifier(objective='binary:logistic',
                                 max_depth=3,
                                 n_estimators=10,
                                 use_label_encoder=False).fit(X_train, Y_train,
                                                              eval_metric=["logloss"])

We trained a xgboost model with above code, then prepared the training data:


Y_train_hat = model.predict(X_train)
X_pos, X_neg = split_X_by_Y(X_train, Y_train_hat)

explainer = shap.Explainer(model)
SHAP_pos = explainer(X_pos).values
SHAP_neg = explainer(X_neg).values

SHAP_pos, SHAP_neg are the Shapley value matrices for positive data and negative data.


model = Classifier(attrs=attrs)
model.fit(X_pos, SHAP_pos, X_neg, SHAP_neg)

model first needs to be initialized with specified attributes, then trained with the prepared input data. After training, the model can make prediction for the test data:


Y_test_hat = model.predict(X_test)
model.print_asp()

The rules generated by SHAP_FOLD will be stored in the model object. These rules are organized in a nested intermediate representation. The nested rules will be automatically flattened and decoded to conform to the syntax of answer set programs by calling print_asp function:

There are many UCI dataset in this repo as examples, you can find the details in the code files.

Also, the process of data preparation is different for a custom classification model. An example for a custom model can be found at example.py.

Limits

The recommended number of feature columns should be less than 1,500. The execution times are much more sensitive to the number of columns.
The time complexity is roughly polynominal to the number of instances and exponential to the number of features for both Shapley value caculation and high utility item-set mining (HUIM).
A tabular dataset with 200 rows and 1500 columns would take about 50 minutes to finish on a desktop with 6 core i5 cpu and 32 GB memory.

Justification by using s(CASP)

The installation of s(CASP) system is necessary for this part. The above examples do not need the s(CASP) system.

Classification and its justification can be conducted with the s(CASP) system. However, each data sample needs to be converted into predicate format that the s(CASP) system expects. The load_data_pred function can be used for this conversion; it returns the data predicates string list. The parameter numerics lists all the numerical features.


nums = ['Age', 'Number_of_Siblings_Spouses', 'Number_Of_Parents_Children', 'Fare']
X_pred = load_data_pred('data/titanic/test.csv', numerics=nums)

Here is an example of the answer set program generated for the titanic dataset by SHAP_FOLD, along with a test data sample converted into the predicate format.


  ab10(X):-fare(X,N80),N80>10.1708,N80=<10.5,not ab7(X).
  ab11(X):-number_of_siblings_spouses(X,N50),N50>0.0,N50=<1.0,fare(X,N91),N91>15.5,N91=<17.4.
  ab12(X):-age(X,'null'),not ab11(X).
  ... ...

  survived(X,'0'):-age(X,'null'),not ab38(X).
  survived(X,'0'):-class(X,'3'),embarked(X,'s'),not ab32(X),not ab33(X).
  survived(X,'0'):-class(X,'3'),not ab35(X),not ab37(X).
  survived(X,'0'):-embarked(X,'s'),not ab38(X),not ab39(X),not ab40(X),not ab41(X).
  ... ...
  % # of rules:  50
  % acc 0.9187 p 0.8946 r 0.9887 f1 0.9393

An easier way to get justification from the s(CASP) system is to call scasp_query function. It will send the generated ASP rules, converted data and a query to the s(CASP) system for justification. A previously specified natural language translation template can make the justification easier to understand, but is not necessary. The template indicates the English string corresponding to a given predicate that models a feature. Here is a (self-explanatory) example of a translation template:


#pred sex(X,Y) :: 'person @(X) is @(Y)'.
#pred age(X,Y) :: 'person @(X) is of age @(Y)'.
#pred number_of_sibling_spouses(X,Y) :: 'person @(X) had @(Y) siblings or spouses'.
... ...
#pred ab2(X) :: 'abnormal case 2 holds for @(X)'.
#pred ab3(X) :: 'abnormal case 3 holds for @(X)'.
... ...

The template file can be loaded to the model object with load_translation function. Then, the justification is generated by calling scasp_query. If the input data is in predicate format, the parameter pred needs to be set as True.


load_translation(model, 'data/titanic/template.txt')
print(scasp_query(model, x, pred=False))

Here is the justification for a passenger in the titanic example above (note that survived(1,0) means that passenger with id 1 perished (denoted by 0):


% QUERY:I would like to know if
'goal' holds (for 2).

	ANSWER:	1 (in 2.368 ms)

JUSTIFICATION_TREE:
'goal' holds (for 2), because
    'survived' holds (for 2, and 0), because
	person 2 is male, and
	person 2 is of age 62.0, and
	person 2 is of age 62.0, justified above.
The global constraints hold.

MODEL:
{ goal(2),  survived(2,0),  sex(2,male),  not ab39(2),  not age(2,Var0 | {Var0 \= 62.0}),  age(2,62.0),  not ab43(2),  not age(2,Var1 | {Var1 \= 62.0}),  not ab41(2),  not age(2,Var2 | {Var2 \= 62.0}) }

The code of the above example can be found in example..

s(CASP)

All the resources of s(CASP) can be found at https://gitlab.software.imdea.org/ciao-lang/sCASP.

peide / SHAP-FOLD-PY