[New_Feature]: Target Encoding - Binary - Ratio + WoE
coforfe opened this issue · comments
Hi Emmanuel-Lin,
These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):
- A ratio encoding P(1)/P(0).
- Calculate the WoE (Weight of Evidence) which is the ln(P(1)/P(0))
And this is a reference code for each case.
For the ratio:
library(data.table)
set.seed(77777)
dt_train <- data.table(
x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
target = sample( c(0,1), 150, replace = TRUE)
)
dt_test <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)
# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1 := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2 := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train
# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])
dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test
And this for the WoE:
library(data.table)
set.seed(77777)
dt_train <- data.table(
x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
target = sample( c(0,1), 150, replace = TRUE)
)
dt_test <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)
# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1 := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2 := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train
# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])
dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test
Thanks!
Carlos.
Hey Carlos,
So I look into this issue.
Correct me if i'm wrong. But to do this feature, we can reuse target_encode
Desining specific function to perform aggregation :
weight_of_evidence <- function(x){
return(log(sum(x == 1) / sum(x == 0)))
}
I can apply it with :
library(dataPreparation)
target_encoding = build_target_encoding(dataSet = dt_train,
cols_to_encode = "x1",
target_col = "target",
functions = "weight_of_evidence")
target_encode(dt_train , target_encoding)
And the result is :
x1 x2 target target_weight_of_evidence_by_x1
1: a aa 1 -0.3715636
2: a bb 1 -0.3715636
3: c bb 0 0.0000000
4: b bb 0 -0.1967103
5: b bb 0 -0.1967103
---
146: a dd 0 -0.3715636
147: a aa 1 -0.3715636
148: a cc 1 -0.3715636
149: c cc 0 0.0000000
150: b aa 0 -0.1967103
So I wonder if
- each user should write it's own specific function of aggregation
dataPreparation
should provide a set of aggregation functions. If so I wonder how to generalize this enough, and which other functions we should add.
Cheers
Hi!.
- I think that if as a starting point, you provide a set of predefined functions will be more than enough.
- I do not know the existence of many other function to encode the target.
Perhaps another functionality could be to encode with the options if you do not specify one in particular, in the same way you have a function to apply all possible enhancements.
Thanks!
Carlos.
A few years later ... it has been implemented and merged in #66
Thanks for your inputs!
Thanks!
Carlos.