R语言SHAP模型解释

Question

R语言SHAP模型解释

ixxmu opened this issue 3 months ago · comments

ixxmu commented 3 months ago

https://mp.weixin.qq.com/s/vibNxgcs3MyduoNacuW7Bw

ixxmu · Answer 1 · Wed Feb 28 2024 12:13:11 GMT+0800 (China Standard Time)

R语言SHAP模型解释 by 医学和生信笔记

关注公众号，发送R语言或python，可获取资料

💡专注R语言在🩺生物医学中的使用

设为“星标”，精彩不错过

分解解释高度依赖于预测变量的顺序，解决方法有两个，一个是通过把最重要的变量放在最前面，另一种就是识别变量间的交互作用并使用专门的方法。

但是以上两种方法都不是很好。所以出现了SHAP(SHapley AdditiveexPlanations)，中文称为Shaply加性解释。SHapley加性解释（SHAP）基于Shapley（人名）在博弈论中提出的“Shapley值（Shaply-values）”。SHAP是专为预测模型设计的方法的首字母缩写词。

简单来说，Shaply加性解释就是计算变量间的所有可能的排列，然后计算每个变量的平均贡献（或者叫平均归因）。这种方法叫做重排（permutation）SHAP或者置换SHAP。

作为一种与模型无关（model-agnostic）的解释，这种方法是适用于任何模型的，本文是以随机森林模型为例进行演示的。

如果已经了解了分解解释的原理，那么这里的重排SHAP就非常好理解了。它的详细公式计算过程这里就不展示了，感兴趣的可以自己了解。

今天先介绍下R中的instance-level的SHAP，依然是使用DALEX，3行代码解决！关于SHAP的内容其实还有非常多哈，以后再慢慢介绍。

公众号后台回复shap即可获取SHAP解释合集。

library(DALEX)
data("titanic_imputed")

# 结果变量变成因子型
titanic_imputed$survived <- factor(titanic_imputed$survived)

dim(titanic_imputed)

[1] 2207    8

str(titanic_imputed)

'data.frame':   2207 obs. of  8 variables:
 $ gender  : Factor w/ 2 levels "female","male": 2 2 2 1 1 2 2 1 2 2 ...
 $ age     : num  42 13 16 39 16 25 30 28 27 20 ...
 $ class   : Factor w/ 7 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 2 2 3 3 ...
 $ embarked: Factor w/ 4 levels "Belfast","Cherbourg",..: 4 4 4 4 4 4 2 2 2 4 ...
 $ fare    : num  7.11 20.05 20.05 20.05 7.13 ...
 $ sibsp   : num  0 0 1 1 0 0 1 1 0 0 ...
 $ parch   : num  0 2 1 1 0 0 0 0 0 0 ...
 $ survived: Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 2 2 2 ...

建立一个随机森林模型：

library(randomForest)

set.seed(123)
titanic_rf <- randomForest(survived ~ ., data = titanic_imputed)

建立解释器：

explain_rf <- DALEX::explain(model = titanic_rf,
                             data = titanic_imputed[,-8],
                             y = titanic_imputed$survived == 1,
                             label = "randomforest"
                             )

Preparation of a new explainer is initiated
  -> model label       :  randomforest 
  -> data              :  2207  rows  7  cols 
  -> target variable   :  2207  values 
  -> predict function  :  yhat.randomForest  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package randomForest , ver. 4.7.1.1 , task classification (  default  ) 
  -> model_info        :  Model info detected classification task but 'y' is a logical . Converted to numeric.  (  NOTE  )
  -> predicted values  :  numerical, min =  0 , mean =  0.2350131 , max =  1  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -0.886 , mean =  0.08714363 , max =  1  
  A new explainer has been created!

使用predict_parts解释，方法选择SHAP：

shap_rf <- predict_parts(explainer = explain_rf,
                         new_observation = titanic_imputed[15,-8],
                         type = "shap",
                         B = 25 # 选择多少个排列组合
                         )

shap_rf

                                              min           q1      median
randomforest: age = 18               -0.010423199  0.006507476  0.02422882
randomforest: class = 3rd            -0.201079293 -0.126367830 -0.06920344
randomforest: embarked = Southampton -0.022489352 -0.010681242 -0.01012868
randomforest: fare = 9.07            -0.154593566 -0.058991844 -0.02455460
randomforest: gender = female         0.293671047  0.384545537  0.43246217
randomforest: parch = 1              -0.031936565  0.080251817  0.10775804
randomforest: sibsp = 0               0.008140462  0.014347757  0.02413484
                                             mean            q3         max
randomforest: age = 18                0.067138668  0.1240188038  0.19714907
randomforest: class = 3rd            -0.090971092 -0.0672904395 -0.01977254
randomforest: embarked = Southampton -0.006165292 -0.0006504304  0.01238423
randomforest: fare = 9.07            -0.037531346 -0.0193303126  0.04265791
randomforest: gender = female         0.436079928  0.4822868147  0.54142003
randomforest: parch = 1               0.092327612  0.1308228364  0.17770367
randomforest: sibsp = 0               0.028108382  0.0478994110  0.05099230

画图：

plot(shap_rf)

这个图中的箱线图表示预测变量在所有排列的分布情况，条形图表示平均值，也就是shaply值。

还可以不展示箱线图：

plot(shap_rf, show_boxplots = F)

DALEX中的plot函数对ggplot2的包装，是可以直接连接ggplot2语法的。

除此之外，我们也可以提取数据自己画图。

library(tidyverse)
library(ggsci)

shap_rf %>% 
  as.data.frame() %>% 
  mutate(mean_con = mean(contribution), .by = variable) %>% 
  mutate(variable = fct_reorder(variable, abs(mean_con))) %>% 
  ggplot() +
  geom_bar(data = \(x) distinct(x,variable,mean_con),
           aes(mean_con, variable,fill= mean_con > 0), alpha = 0.5,
           stat = "identity")+
  geom_boxplot(aes(contribution,variable,fill= mean_con > 0), width = 0.4)+
  scale_fill_lancet()+
  labs(y = NULL)+
  theme(legend.position = "none")

OVER！

SHAP的使用率非常高，在R语言中也有非常多实现SHAP的包，我会写多篇推文，把常用的全都介绍一遍。

联系我们，关注我们
免费QQ交流群1：613637742
免费QQ交流群2：608720452
公众号消息界面关于作者获取联系方式
知乎、CSDN、简书同名账号
哔哩哔哩：阿越就是我