[New_Feature]: Target Encoding - Binary - Ratio + WoE

Question

[New_Feature]: Target Encoding - Binary - Ratio + WoE

coforfe opened this issue 5 years ago · comments

Hi Emmanuel-Lin,

These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):

A ratio encoding P(1)/P(0).
Calculate the WoE (Weight of Evidence) which is the ln(P(1)/P(0))

And this is a reference code for each case.

For the ratio:

library(data.table)
set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1   := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2   := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

And this for the WoE:

library(data.table)

set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1  := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2  := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

Thanks!
Carlos.

ELToulemonde · Answer 1 · Tue Jul 16 2019 21:24:34 GMT+0800 (China Standard Time)

Hey Carlos,

So I look into this issue.

Correct me if i'm wrong. But to do this feature, we can reuse target_encode

Desining specific function to perform aggregation :

weight_of_evidence <- function(x){
  return(log(sum(x == 1) / sum(x == 0)))
}

I can apply it with :

library(dataPreparation)


target_encoding = build_target_encoding(dataSet = dt_train, 
                                        cols_to_encode = "x1", 
                                        target_col = "target",
                                        functions = "weight_of_evidence")

target_encode(dt_train , target_encoding)

And the result is :

     x1 x2 target target_weight_of_evidence_by_x1
  1:  a aa      1                      -0.3715636
  2:  a bb      1                      -0.3715636
  3:  c bb      0                       0.0000000
  4:  b bb      0                      -0.1967103
  5:  b bb      0                      -0.1967103
 ---                                             
146:  a dd      0                      -0.3715636
147:  a aa      1                      -0.3715636
148:  a cc      1                      -0.3715636
149:  c cc      0                       0.0000000
150:  b aa      0                      -0.1967103

So I wonder if

each user should write it's own specific function of aggregation
dataPreparation should provide a set of aggregation functions. If so I wonder how to generalize this enough, and which other functions we should add.

Cheers

Carlos Ortega · Answer 2 · Wed Jul 17 2019 01:49:30 GMT+0800 (China Standard Time)

Hi!.

I think that if as a starting point, you provide a set of predefined functions will be more than enough.
I do not know the existence of many other function to encode the target.

Perhaps another functionality could be to encode with the options if you do not specify one in particular, in the same way you have a function to apply all possible enhancements.

Thanks!
Carlos.

ELToulemonde · Answer 3 · Fri Jul 15 2022 19:15:03 GMT+0800 (China Standard Time)

A few years later ... it has been implemented and merged in #66

Thanks for your inputs!

Carlos Ortega · Answer 4 · Fri Jul 15 2022 20:06:10 GMT+0800 (China Standard Time)

Thanks!
Carlos.