ELToulemonde / dataPreparation

Data preparation for data science projects.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[New_Feature]: Target Encoding.

coforfe opened this issue · comments

Hello Emmanuel-Lin,

How are you?
Long time without proposing to you new ideas to enhance your very good package.
This time what I would like to propose is how to encode as numeric a categorical variable based on the value of a target column. The target column could be the one we use for binary, regression or multiclass.

The idea is reflected in this code:

library(data.table)

dt_train <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
                  target = sample( c(0,1), 15, replace = TRUE)
)

dt_test <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train
dt_train[ , te_x1 := mean(target), by=x1]
dt_train[ , te_x2 := mean(target), by=x2]

dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, te_x1)])
x2_vals <- unique(dt_train[, .(x2, te_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

The main issue is what you see for the dt_test set. Tipycally it hasn't the target variable, so the idea is to use the encoding calculated in the dt_train set and apply it to the dt_test. So to apply this transformation the function should require the test set in the workspace.

After applying the encoding the dt_test would be something like this:

> dt_test
    x2 x1     te_x1     te_x2
 1: dd  b 0.2500000 0.5000000
 2: aa  b 0.2500000 0.3333333
 3: dd  b 0.2500000 0.5000000
 4: dd  b 0.2500000 0.5000000
 5: bb  b 0.2500000 0.3333333
 6: aa  a 0.6666667 0.3333333
 7: cc  a 0.6666667 0.4000000
 8: bb  c 0.3750000 0.3333333

Also, you can encode with several functions, I used mean().

This feature is also applied in many dataset transformations as a _feature engineering_ option. Very recently H2O included in their framework:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html

Thank you very much for you work with your package,
Carlos.

Hello Carlos,

I'm good and you ?

Thanks for the idea. And the great details you gave.

Feature has been implemented on branch build_target_encoding. Could you check it out, to control that it is what you expected. If you have any suggestion on implementations please don't hesitate.

Usage:

library(data.table)
library(dataPreparation)

dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
  target = sample( c(0,1), 15, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train
x1_target_encoding <- build_target_encoding(dt_train, col_to_encode="x1", target_col="target", functions=c("mean"))
dt_train <- target_encode(dt_train, col_to_encode="x1", target_encoding=x1_target_encoding)
dt_test <- target_encode(dt_test, col_to_encode="x1", target_encoding=x1_target_encoding)
dt_test

x2_target_encoding <- build_target_encoding(dt_train, col_to_encode="x2", target_col="target", functions=c("mean"))
dt_train <- target_encode(dt_train, col_to_encode="x2", target_encoding=x2_target_encoding)
dt_test <- target_encode(dt_test, col_to_encode="x2", target_encoding=x2_target_encoding)
dt_test

Hello Enmanuel-Lin,

Thanks!.
I have just checked it and works perfectly!.

> 
> dt_train <- data.table(
+     x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
+     x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
+     target = sample( c(0,1), 15, replace = TRUE)
+ )
> 
> dt_test <- data.table(
+     x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
+     x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
+ )
> 
> # Target Encoding Train
> x1_target_encoding <- build_target_encoding(dt_train, col_to_encode="x1", target_col="target", functions=c("mean"))
> dt_train <- target_encode(dt_train, col_to_encode="x1", target_encoding=x1_target_encoding)
> dt_test <- target_encode(dt_test, col_to_encode="x1", target_encoding=x1_target_encoding)
> dt_test
    x1 x2 target_mean_by_x1
 1:  c dd               0.5
 2:  a bb               0.5
 3:  c cc               0.5
 4:  a dd               0.5
 5:  a dd               0.5
 6:  a bb               0.5
 7:  b dd               0.2
 8:  c aa               0.5
 9:  c bb               0.5
10:  b cc               0.2
11:  b bb               0.2
12:  a cc               0.5
13:  b bb               0.2
14:  c cc               0.5
15:  c dd               0.5
> 
> x2_target_encoding <- build_target_encoding(dt_train, col_to_encode="x2", target_col="target", functions=c("mean"))
> dt_train <- target_encode(dt_train, col_to_encode="x2", target_encoding=x2_target_encoding)
> dt_test <- target_encode(dt_test, col_to_encode="x2", target_encoding=x2_target_encoding)
> dt_test
    x2 x1 target_mean_by_x1 target_mean_by_x2
 1: dd  c               0.5         0.2500000
 2: bb  a               0.5         0.6666667
 3: cc  c               0.5         0.2000000
 4: dd  a               0.5         0.2500000
 5: dd  a               0.5         0.2500000
 6: bb  a               0.5         0.6666667
 7: dd  b               0.2         0.2500000
 8: aa  c               0.5         0.6666667
 9: bb  c               0.5         0.6666667
10: cc  b               0.2         0.2000000
11: bb  b               0.2         0.6666667
12: cc  a               0.5         0.2000000
13: bb  b               0.2         0.6666667
14: cc  c               0.5         0.2000000
15: dd  c               0.5         0.2500000

One enhancement of this same thing could be to pass a list of variables to encode, well I imagine that you will include this within prepareSet() function in this way.

Also, there are some variations of this approach:

  • by calculating the probabilities of (0 and 1 by category) and calculating the ratios P(1)/P(0).
  • And the log of this ratio (which is the WoE).

I will send to you with this same dataframe and example.

Thanks again!.
Carlos.

Thanks,
Carlos.

About multiple cols, i'm not quite sure how to implement it.

  1. Are multiple cols encoded according to the same target col and same functions ?
    Parms would be cols_to_encode, target_col, functions

  2. A col is encoded by multiple target cols and functions ?
    params would be col_to_encode, target_cols, functionss

  3. Multiple cols could be encoded by multiples target cols (specific to each col to encode) and functions (specific to each col to encode)
    Params would be
    list(col_to_encode_1 = list(target_cols = list("col1", "col2"), functions = list("mean", "sum"))

THis last case seem really complicated to use

OK. Sorry I did not explain it correctly.

  • I was just referring to the case where you indicate several categorical columns to encode against the same target.

  • In the example we coded "x1" and "x2" sequentially, the idea is to provide a list with the columns to encode and the target and internally the function would loop over each variable to get "target_mean_by_x1" and "target_mean_by_x2" columns.

Thanks!.

Ok, it's clearer for me.

In what kind of cases would one need that in real lige ?

Yes, well this ways of encoding categorical variables are part of the feature engineering phase. And can have a significative impact in the performance of the model.

You can consider this "mean arget encoding" as an alternative of what we talked and you have already implemented when transforming categorical variables by their frequencies. But in this case, instead of doing a kind of blind encoding (you just count ocurrences) you encode but considering the target variable.

All these different transformations you are putting together in your package have this same goal of improving model performance.

Hey Carlos,

With a bit of delay, I have implemented both those functionnality.

  • target_encode and build_target_encoding now accept multiple columns to encode. The way to do it is to provide a list to cols_to_encode in target_encode
  • prepareSet call those functions when argument target_col is provided and optionnaly target_encoding_functions is provided.

This will be in next release.

I will look into your other issue before releasing on CRAN.

Emmanuel-Lin