step_smote breaks with some over_ratios
rowanjh opened this issue · comments
The problem
step_smote
throws an error when a class is marginally below the target upsampling threshold. This only occurs in rare cases when smote tries to synthesize a fraction of a sample (n < 1).
The issue occurs in themis:::smote_impl
. When samples_needed
contains a number smaller than 1, themis:::smote_data
returns an empty matrix, subsequently causing an error at the line out_df[var] <- data[[names(samples_needed)[i]]][[var]][1]
One quick solution might be to floor
the ratio_target
when checking which classes need to be upsampled in themis:::smote_impl
, as below. This would ignore the variables fractionally below the target threshold.
which_upsample <- which(table(df[[var]]) < ratio_target) # orig
which_upsample <- which(table(df[[var]]) < floor(ratio_target)) # possible solution
Reproducible example
library(recipes)
library(themis)
set.seed(123)
dat <- data.frame(
outcome = c(rep("X", 101), rep("Z", 50)),
X1 = rnorm(151))
rec_49 <- recipe(outcome ~ ., data = head(dat)) |>
step_smote(outcome, over_ratio = 0.49, seed = 231)
rec_50 <- recipe(outcome ~ ., data = head(dat)) |>
step_smote(outcome, over_ratio = 0.5, seed = 231)
rec_51 <- recipe(outcome ~ ., data = head(dat)) |>
step_smote(outcome, over_ratio = 0.51, seed = 231)
prep(rec_49, dat) # works
prep(rec_51, dat) # works
prep(rec_50, dat) # does not work
#> Error in `step_smote()`:
#> Caused by error in `[<-.data.frame`:
#> ! replacement has 1 row, data has 0
#> Backtrace:
#> x
#> 1. +-recipes::prep(rec_50, dat)
#> 2. +-recipes:::prep.recipe(rec_50, dat)
#> 3. | +-recipes:::recipes_error_context(...)
#> 4. | | +-base::withCallingHandlers(...)
#> 5. | | \-base::force(expr)
#> 6. | +-recipes::bake(x$steps[[i]], new_data = training)
#> 7. | \-themis:::bake.step_smote(x$steps[[i]], new_data = training)
#> 8. | +-withr::with_seed(...)
#> 9. | | \-withr::with_preserve_seed(...)
#> 10. | \-themis:::smote_impl(...)
#> 11. | +-base::`[<-`(`*tmp*`, var, value = `<fct>`)
#> 12. | \-base::`[<-.data.frame`(`*tmp*`, var, value = `<fct>`)
#> 13. | \-base::stop(...)
#> 14. \-base::.handleSimpleError(...)
#> 15. \-recipes (local) h(simpleError(msg, call))
#> 16. \-recipes:::stop_recipes_step(call = call(step_name), parent = cnd)
#> 17. \-recipes:::stop_recipes(...)
#> 18. \-rlang::abort(...)
Created on 2023-02-08 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.2 (2021-11-01)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Australia.1252
#> ctype English_Australia.1252
#> tz Europe/Berlin
#> date 2023-02-08
#> pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> class 7.3-19 2021-05-03 [2] CRAN (R 4.1.2)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.1.3)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.2)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.1.3)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.2)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> future 1.26.1 2022-05-27 [1] CRAN (R 4.1.3)
#> future.apply 1.8.1 2021-08-10 [1] CRAN (R 4.1.2)
#> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
#> globals 0.15.0 2022-05-09 [1] CRAN (R 4.1.3)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.3)
#> gower 1.0.0 2022-02-03 [1] CRAN (R 4.1.2)
#> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.1.3)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.2)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.2)
#> ipred 0.9-12 2021-09-15 [1] CRAN (R 4.1.2)
#> knitr 1.38 2022-03-25 [1] CRAN (R 4.1.3)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#> lava 1.6.10 2021-09-02 [1] CRAN (R 4.1.2)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.3)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.1.2)
#> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.2)
#> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
#> MASS 7.3-56 2022-03-23 [1] CRAN (R 4.1.3)
#> Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.1.3)
#> nnet 7.3-16 2021-05-03 [2] CRAN (R 4.1.2)
#> parallelly 1.32.0 2022-06-07 [1] CRAN (R 4.1.2)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2)
#> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.1.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.2)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.1.3)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.3)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
#> RANN 2.6.1 2019-01-08 [1] CRAN (R 4.1.3)
#> Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
#> recipes * 1.0.4 2023-01-11 [1] CRAN (R 4.1.3)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.1.3)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.3)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.2)
#> ROSE 0.0-4 2021-06-14 [1] CRAN (R 4.1.3)
#> rpart 4.1-15 2019-04-12 [2] CRAN (R 4.1.2)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.2)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.2)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.1.3)
#> survival 3.2-13 2021-08-24 [2] CRAN (R 4.1.2)
#> themis * 1.0.0 2022-07-02 [1] CRAN (R 4.1.3)
#> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3)
#> tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.3)
#> timeDate 3043.102 2018-02-21 [1] CRAN (R 4.1.2)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
#> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.1.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
#> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
#> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2)
#>
#> [1] C:/Users/rowan/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.2/library
#>
#> ------------------------------------------------------------------------------
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.