tidymodels / themis

Extra recipes steps for dealing with unbalanced data

Home Page:https://themis.tidymodels.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

step_smote breaks with some over_ratios

rowanjh opened this issue · comments

The problem

step_smote throws an error when a class is marginally below the target upsampling threshold. This only occurs in rare cases when smote tries to synthesize a fraction of a sample (n < 1).

The issue occurs in themis:::smote_impl. When samples_needed contains a number smaller than 1, themis:::smote_data returns an empty matrix, subsequently causing an error at the line out_df[var] <- data[[names(samples_needed)[i]]][[var]][1]

One quick solution might be to floor the ratio_target when checking which classes need to be upsampled in themis:::smote_impl, as below. This would ignore the variables fractionally below the target threshold.
which_upsample <- which(table(df[[var]]) < ratio_target) # orig
which_upsample <- which(table(df[[var]]) < floor(ratio_target)) # possible solution

Reproducible example

library(recipes)
library(themis)
set.seed(123)

dat <- data.frame(
    outcome = c(rep("X", 101), rep("Z", 50)),
    X1 = rnorm(151))

rec_49 <- recipe(outcome ~ ., data = head(dat)) |>
    step_smote(outcome, over_ratio = 0.49, seed = 231)
rec_50 <- recipe(outcome ~ ., data = head(dat)) |>
    step_smote(outcome, over_ratio = 0.5, seed = 231)
rec_51 <- recipe(outcome ~ ., data = head(dat)) |>
    step_smote(outcome, over_ratio = 0.51, seed = 231)

prep(rec_49, dat) # works
prep(rec_51, dat) # works
prep(rec_50, dat) # does not work
#> Error in `step_smote()`:
#> Caused by error in `[<-.data.frame`:
#> ! replacement has 1 row, data has 0

#> Backtrace:
#>      x
#>   1. +-recipes::prep(rec_50, dat)
#>   2. +-recipes:::prep.recipe(rec_50, dat)
#>   3. | +-recipes:::recipes_error_context(...)
#>   4. | | +-base::withCallingHandlers(...)
#>   5. | | \-base::force(expr)
#>   6. | +-recipes::bake(x$steps[[i]], new_data = training)
#>   7. | \-themis:::bake.step_smote(x$steps[[i]], new_data = training)
#>   8. |   +-withr::with_seed(...)
#>   9. |   | \-withr::with_preserve_seed(...)
#>  10. |   \-themis:::smote_impl(...)
#>  11. |     +-base::`[<-`(`*tmp*`, var, value = `<fct>`)
#>  12. |     \-base::`[<-.data.frame`(`*tmp*`, var, value = `<fct>`)
#>  13. |       \-base::stop(...)
#>  14. \-base::.handleSimpleError(...)
#>  15.   \-recipes (local) h(simpleError(msg, call))
#>  16.     \-recipes:::stop_recipes_step(call = call(step_name), parent = cnd)
#>  17.       \-recipes:::stop_recipes(...)
#>  18.         \-rlang::abort(...)

Created on 2023-02-08 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Australia.1252
#>  ctype    English_Australia.1252
#>  tz       Europe/Berlin
#>  date     2023-02-08
#>  pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version    date (UTC) lib source
#>  class          7.3-19     2021-05-03 [2] CRAN (R 4.1.2)
#>  cli            3.6.0      2023-01-09 [1] CRAN (R 4.1.3)
#>  codetools      0.2-18     2020-11-04 [2] CRAN (R 4.1.2)
#>  crayon         1.5.1      2022-03-26 [1] CRAN (R 4.1.3)
#>  digest         0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr        * 1.0.8      2022-02-08 [1] CRAN (R 4.1.3)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15       2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi          1.0.3      2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.1.2)
#>  fs             1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
#>  future         1.26.1     2022-05-27 [1] CRAN (R 4.1.3)
#>  future.apply   1.8.1      2021-08-10 [1] CRAN (R 4.1.2)
#>  generics       0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
#>  globals        0.15.0     2022-05-09 [1] CRAN (R 4.1.3)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.1.3)
#>  gower          1.0.0      2022-02-03 [1] CRAN (R 4.1.2)
#>  hardhat        1.2.0      2022-06-30 [1] CRAN (R 4.1.3)
#>  highr          0.9        2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools      0.5.2      2021-08-25 [1] CRAN (R 4.1.2)
#>  ipred          0.9-12     2021-09-15 [1] CRAN (R 4.1.2)
#>  knitr          1.38       2022-03-25 [1] CRAN (R 4.1.3)
#>  lattice        0.20-45    2021-09-22 [2] CRAN (R 4.1.2)
#>  lava           1.6.10     2021-09-02 [1] CRAN (R 4.1.2)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.1.3)
#>  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.1.2)
#>  lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
#>  MASS           7.3-56     2022-03-23 [1] CRAN (R 4.1.3)
#>  Matrix         1.5-3      2022-11-11 [1] CRAN (R 4.1.3)
#>  nnet           7.3-16     2021-05-03 [2] CRAN (R 4.1.2)
#>  parallelly     1.32.0     2022-06-07 [1] CRAN (R 4.1.2)
#>  pillar         1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.1.2)
#>  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.1.2)
#>  purrr          0.3.4      2020-04-17 [1] CRAN (R 4.1.2)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.1.3)
#>  R.methodsS3    1.8.1      2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0     2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0     2021-09-26 [1] CRAN (R 4.1.3)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.2)
#>  RANN           2.6.1      2019-01-08 [1] CRAN (R 4.1.3)
#>  Rcpp           1.0.8.3    2022-03-17 [1] CRAN (R 4.1.3)
#>  recipes      * 1.0.4      2023-01-11 [1] CRAN (R 4.1.3)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.1.3)
#>  rlang          1.0.6      2022-09-24 [1] CRAN (R 4.1.3)
#>  rmarkdown      2.11       2021-09-14 [1] CRAN (R 4.1.2)
#>  ROSE           0.0-4      2021-06-14 [1] CRAN (R 4.1.3)
#>  rpart          4.1-15     2019-04-12 [2] CRAN (R 4.1.2)
#>  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.1.2)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.1.2)
#>  styler         1.7.0      2022-03-13 [1] CRAN (R 4.1.3)
#>  survival       3.2-13     2021-08-24 [2] CRAN (R 4.1.2)
#>  themis       * 1.0.0      2022-07-02 [1] CRAN (R 4.1.3)
#>  tibble         3.1.7      2022-05-03 [1] CRAN (R 4.1.3)
#>  tidyr          1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.1.3)
#>  timeDate       3043.102   2018-02-21 [1] CRAN (R 4.1.2)
#>  utf8           1.2.2      2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.5.2      2023-01-23 [1] CRAN (R 4.1.3)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun           0.29       2021-12-14 [1] CRAN (R 4.1.2)
#>  yaml           2.2.2      2022-01-25 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/rowan/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2/library
#> 
#> ------------------------------------------------------------------------------

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.