tidymodels / themis

Extra recipes steps for dealing with unbalanced data

Home Page:https://themis.tidymodels.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproducibility using SMOTE

rmurphy49 opened this issue · comments

Apologies if this is more of a question than a feature request. I am currently using SMOTE for synthetic data generation using different omics data types for the same set of patients/samples. I am using SMOTE on each data type individually using the same seed and binding the results as a single larger feature set for further analysis. I have attached a simple example below.

My question relates to whether using the same seeds here with SMOTE would use the same samples for synthetic data generation, and thus creates the same synthetic samples across each omics data type? As a simple example, synthetic sample 1 would be generated from the same sample neighbors as sample 1, 4, and 7 for each omics data type when the different omics data sets were in the same order of samples as each other.

set.seed(1)
trainGE <- smote(gene_exp, "Class")

set.seed(1)
trainClinical <- smote(clinical_info, "Class")

new_balanced_data <- cbind(trainGE, trainClinical)

Setting the seed when using smote() affects 3 things.

  • Which observations that should be smoted. if over_ratio is set to a value that requires 100 observations to be smoted, but there are 250 observations in that class, then 100 will be randomly sampled from the 250.
  • which of the k neighbors that is used.
  • How close the new observation will be between the observation and the selected neighbor.

so if the number of observations are equal, and the same k is set for both, then it will select the same observations across the two data sets to smote on. It doesn't guarantee that the nearest neighbors are calculated correctly since you would need the full data to calculate that.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.