ELToulemonde / dataPreparation

Data preparation for data science projects.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[New_Feature]: Target Encoding - Binary - Ratio + WoE

coforfe opened this issue · comments

Hi Emmanuel-Lin,

These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):

  1. A ratio encoding P(1)/P(0).
  2. Calculate the WoE (Weight of Evidence) which is the ln(P(1)/P(0))

And this is a reference code for each case.

For the ratio:

library(data.table)
set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1   := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2   := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

And this for the WoE:

library(data.table)

set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1  := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2  := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

Thanks!
Carlos.

Hey Carlos,

So I look into this issue.

Correct me if i'm wrong. But to do this feature, we can reuse target_encode

Desining specific function to perform aggregation :

weight_of_evidence <- function(x){
  return(log(sum(x == 1) / sum(x == 0)))
}

I can apply it with :

library(dataPreparation)


target_encoding = build_target_encoding(dataSet = dt_train, 
                                        cols_to_encode = "x1", 
                                        target_col = "target",
                                        functions = "weight_of_evidence")

target_encode(dt_train , target_encoding)

And the result is :

     x1 x2 target target_weight_of_evidence_by_x1
  1:  a aa      1                      -0.3715636
  2:  a bb      1                      -0.3715636
  3:  c bb      0                       0.0000000
  4:  b bb      0                      -0.1967103
  5:  b bb      0                      -0.1967103
 ---                                             
146:  a dd      0                      -0.3715636
147:  a aa      1                      -0.3715636
148:  a cc      1                      -0.3715636
149:  c cc      0                       0.0000000
150:  b aa      0                      -0.1967103

So I wonder if

  • each user should write it's own specific function of aggregation
  • dataPreparation should provide a set of aggregation functions. If so I wonder how to generalize this enough, and which other functions we should add.

Cheers

Hi!.

  • I think that if as a starting point, you provide a set of predefined functions will be more than enough.
  • I do not know the existence of many other function to encode the target.

Perhaps another functionality could be to encode with the options if you do not specify one in particular, in the same way you have a function to apply all possible enhancements.

Thanks!
Carlos.

A few years later ... it has been implemented and merged in #66

Thanks for your inputs!

Thanks!
Carlos.