shihgianlee / shparkley

Spark implementation of computing Shapley Values using monte-carlo approximation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shparkley: Scaling Shapley Values with Spark

Shparkley is a PySpark implementation of Shapley values which uses a monte-carlo approximation algorithm. Shparkley implementation is based on An Efficient Explanation of Individual Classifications using Game Theory algorithm.

Given a dataset and machine learning model, Shparkley can compute Shapley values for all features for a feature vector. Shparkley also handles training weights and is model agnostic.

pip install shparkley

You must have Apache Spark installed on your machine/cluster.

from affirm.model_interpretation.shparkley.spark_shapley import (
    compute_shapley_for_sample,
    ShparkleyModel
)

class MyShparkleyModel(ShparkleyModel):
"""
You need to wrap your model with the ShparkleyModel interface.
"""
    def get_required_features(self):
        # type: () -> Set[str]
        """
        Needs to return a set of feature names for the model.
        """
        return ['feature-1', 'feature-2', 'feature-3']

    def predict(self, feature_matrix):
        # type: (List[Dict[str, Any]]) -> List[float]
        """
        Wrapper function to convert the feature matrix into an acceptable format for your model.
        This function should return the predicted probabilities.
        The feature_matrix is a list of feature dictionaries.
        Each dictionary has a mapping from the feature name to the value.
        :return: Model predictions for all feature vectors
        """
        # Convert the feature matrix into an appropriate form for your model object.
        pd_df = pd.DataFrame.from_dict(feature_matrix)
        preds = self._model.my_predict(pd_df)
        return preds

row = dataset.filter(dataset.row_id = 'xxxx').rdd.first()
shparkley_wrapped_model = MyShparkleyModel(my_model)

# You need to sample your dataset based on convergence criteria.
# More samples results in more accurate shapley values.
# Repartitioning and caching the sampled dataframe will speed up computation.
sampled_df = training_df.sample(0.1, True).repartition(75).cache()

shapley_scores_by_feature = compute_shapley_for_sample(
    df=sampled_df,
    model=shparkley_wrapped_model,
    row_to_investigate=row,
    weight_col_name='training_weight_column_name'
)

About

Spark implementation of computing Shapley Values using monte-carlo approximation

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%