tidymodels / themis

Extra recipes steps for dealing with unbalanced data

Home Page:https://themis.tidymodels.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[New Feature]: "cube" method for balanced sampling.

coforfe opened this issue · comments

Hi Emil,

Thanks for your work on this package.

Feature

I recently discovered this balance sampling method cube (paper). I think it could be very nice to have it included here in themis.

It is already included in package BalancedSampling which offers a very fast (C++) implementation.

Thanks again,
Carlos.

Hello @coforfe 👋 Looks interesting! Could you say a little bit how this methods to be used to handle unbalanced data?

Hi Emil,

Yes, perhaps with this commented example, it is more clear.

#-------------------- Library Loading --------------
library(sampling)    # To use samplecube function - slow way
library(scorecard)   # To get germancredit data.
library(dplyr)       # Data wrangling.
library(janitor)     # To clean names.
library(recipes)     # To apply some transformations.

#------ Some checks.
#-- Target Distribution
table(germancredit$creditability)

# bad good 
# 300  700 

#------ Let's sample 300 rows from the "good" class.
#-- For demonstration purposes we will use just a subset of columns.
#-- Some numeric columns and some of the character columns.

german_good <- germancredit %>%
                clean_names() %>%
                filter(creditability == "good") %>%
                select( duration_in_month, credit_amount, age_in_years, 
                        status_of_existing_checking_account, credit_history, 
                        savings_account_and_bonds, foreign_worker, creditability) 

german_good_tr <- recipe( creditability ~ . , data = german_good) %>%
                          step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
                          prep() %>% bake( new_data = NULL) %>%
                          select(-creditability) %>%
                          clean_names()

#---- Preparation for samplecube
target_df <- as.data.frame(table(germancredit$creditability))
perc_minority <- min(target_df$Freq) / max(target_df$Freq)
prob_minority <- rep(perc_minority, nrow(german_good_tr)) 

german_good_matrix  <- as.matrix(german_good_tr)

#------------------------ SLOW APPROACH --------------
#---- Cube sampling gets back row indexes to sample.
s_slow <- samplecube( X = german_good_matrix, pik = prob_minority, order = 2, comment = FALSE, method = 1)

#---- Select subsample
german_slow <- german_good[ s_slow, ]


#--- Final sampled data.frame 
german_sampled_slow <- rbind(
                        german_slow,
                        germancredit %>% filter(creditability == "bad") %>%
                           clean_names() %>%
                           select( duration_in_month, credit_amount, age_in_years, 
                                   status_of_existing_checking_account, credit_history, 
                                   savings_account_and_bonds, foreign_worker, creditability) 
)




#--------------------- FAST APPROACH -------------
#--------------------- By using fast "BalancedSampling".
library(BalancedSampling)

#--- "cube" function gets back row indexes to sample.
s_fast <- cube( p = prob_minority, X = german_good_matrix)

#---- Select subsample
german_fast <- german_good[ s_fast, ]

#--- Final sampled data.frame 
german_sampled_fast <- rbind(
  german_fast,
  germancredit %>% filter(creditability == "bad") %>%
    clean_names() %>%
    select( duration_in_month, credit_amount, age_in_years, 
            status_of_existing_checking_account, credit_history, 
            savings_account_and_bonds, foreign_worker, creditability) 
)

Thanks,
Carlos.

That looks like it would work. I'll get back to it in a little bit, I'm working on some higher priority projects these few months.

Thanks Emil!.
Good to hear that the example clarified the case.