R语言SHAP模型解释
ixxmu opened this issue · comments
R语言SHAP模型解释 by 医学和生信笔记
💡专注R语言在🩺生物医学中的使用
设为“星标”,精彩不错过
分解解释高度依赖于预测变量的顺序,解决方法有两个,一个是通过把最重要的变量放在最前面,另一种就是识别变量间的交互作用并使用专门的方法。
但是以上两种方法都不是很好。所以出现了SHAP(SHapley AdditiveexPlanations),中文称为Shaply加性解释。SHapley加性解释(SHAP)基于Shapley(人名)在博弈论中提出的“Shapley值(Shaply-values)”。SHAP是专为预测模型设计的方法的首字母缩写词。
简单来说,Shaply加性解释就是计算变量间的所有可能的排列,然后计算每个变量的平均贡献(或者叫平均归因)。这种方法叫做重排(permutation)SHAP或者置换SHAP。
作为一种与模型无关(model-agnostic)的解释,这种方法是适用于任何模型的,本文是以随机森林模型为例进行演示的。
如果已经了解了分解解释的原理,那么这里的重排SHAP就非常好理解了。它的详细公式计算过程这里就不展示了,感兴趣的可以自己了解。
今天先介绍下R中的instance-level的SHAP,依然是使用DALEX
,3行代码解决!关于SHAP的内容其实还有非常多哈,以后再慢慢介绍。
公众号后台回复shap即可获取SHAP解释合集。
library(DALEX)
data("titanic_imputed")
# 结果变量变成因子型
titanic_imputed$survived <- factor(titanic_imputed$survived)
dim(titanic_imputed)
[1] 2207 8
str(titanic_imputed)
'data.frame': 2207 obs. of 8 variables:
$ gender : Factor w/ 2 levels "female","male": 2 2 2 1 1 2 2 1 2 2 ...
$ age : num 42 13 16 39 16 25 30 28 27 20 ...
$ class : Factor w/ 7 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 2 2 3 3 ...
$ embarked: Factor w/ 4 levels "Belfast","Cherbourg",..: 4 4 4 4 4 4 2 2 2 4 ...
$ fare : num 7.11 20.05 20.05 20.05 7.13 ...
$ sibsp : num 0 0 1 1 0 0 1 1 0 0 ...
$ parch : num 0 2 1 1 0 0 0 0 0 0 ...
$ survived: Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 2 2 2 ...
建立一个随机森林模型:
library(randomForest)
set.seed(123)
titanic_rf <- randomForest(survived ~ ., data = titanic_imputed)
建立解释器:
explain_rf <- DALEX::explain(model = titanic_rf,
data = titanic_imputed[,-8],
y = titanic_imputed$survived == 1,
label = "randomforest"
)
Preparation of a new explainer is initiated
-> model label : randomforest
-> data : 2207 rows 7 cols
-> target variable : 2207 values
-> predict function : yhat.randomForest will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package randomForest , ver. 4.7.1.1 , task classification ( default )
-> model_info : Model info detected classification task but 'y' is a logical . Converted to numeric. ( NOTE )
-> predicted values : numerical, min = 0 , mean = 0.2350131 , max = 1
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -0.886 , mean = 0.08714363 , max = 1
A new explainer has been created!
使用predict_parts
解释,方法选择SHAP:
shap_rf <- predict_parts(explainer = explain_rf,
new_observation = titanic_imputed[15,-8],
type = "shap",
B = 25 # 选择多少个排列组合
)
shap_rf
min q1 median
randomforest: age = 18 -0.010423199 0.006507476 0.02422882
randomforest: class = 3rd -0.201079293 -0.126367830 -0.06920344
randomforest: embarked = Southampton -0.022489352 -0.010681242 -0.01012868
randomforest: fare = 9.07 -0.154593566 -0.058991844 -0.02455460
randomforest: gender = female 0.293671047 0.384545537 0.43246217
randomforest: parch = 1 -0.031936565 0.080251817 0.10775804
randomforest: sibsp = 0 0.008140462 0.014347757 0.02413484
mean q3 max
randomforest: age = 18 0.067138668 0.1240188038 0.19714907
randomforest: class = 3rd -0.090971092 -0.0672904395 -0.01977254
randomforest: embarked = Southampton -0.006165292 -0.0006504304 0.01238423
randomforest: fare = 9.07 -0.037531346 -0.0193303126 0.04265791
randomforest: gender = female 0.436079928 0.4822868147 0.54142003
randomforest: parch = 1 0.092327612 0.1308228364 0.17770367
randomforest: sibsp = 0 0.028108382 0.0478994110 0.05099230
画图:
plot(shap_rf)
这个图中的箱线图表示预测变量在所有排列的分布情况,条形图表示平均值,也就是shaply值。
还可以不展示箱线图:
plot(shap_rf, show_boxplots = F)
DALEX
中的plot
函数对ggplot2
的包装,是可以直接连接ggplot2
语法的。
除此之外,我们也可以提取数据自己画图。
library(tidyverse)
library(ggsci)
shap_rf %>%
as.data.frame() %>%
mutate(mean_con = mean(contribution), .by = variable) %>%
mutate(variable = fct_reorder(variable, abs(mean_con))) %>%
ggplot() +
geom_bar(data = \(x) distinct(x,variable,mean_con),
aes(mean_con, variable,fill= mean_con > 0), alpha = 0.5,
stat = "identity")+
geom_boxplot(aes(contribution,variable,fill= mean_con > 0), width = 0.4)+
scale_fill_lancet()+
labs(y = NULL)+
theme(legend.position = "none")
OVER!
SHAP的使用率非常高,在R语言中也有非常多实现SHAP的包,我会写多篇推文,把常用的全都介绍一遍。
联系我们,关注我们
免费QQ交流群1:613637742 免费QQ交流群2:608720452 公众号消息界面关于作者获取联系方式 知乎、CSDN、简书同名账号 哔哩哔哩:阿越就是我