Process original datasets to create static train/test splits

Question

Process original datasets to create static train/test splits

johnantonn opened this issue 2 years ago · comments

Ioannis Antoniadis commented 2 years ago

Incorporate the below marked datasets after preprocessing them first. The process should be as follows:

Download the original versions of each dataset, unnormalized, without duplicates.
Load, subsample, split to train test, normalize on train and apply to test, and finally save as files.

Ioannis Antoniadis · Answer 1 · Mon Mar 28 2022 21:04:45 GMT+0800 (China Standard Time)

Datasets, subsampled to 5000 points and stratified split to train/test with train_size = 30%:

Training set: 3500 or less
Test set: 1500 or less

Details about the training sets (from which the validation sets will be generated):

ALOI
Total: 3500
Normal: 3383
Outliers: 117
Annthyroid
Total: 3500
Normal: 3232
Outliers: 268
Waveform
Total: 2410
Normal: 2340
Outliers: 70
Cardiotocography
Total: 1479
Normal: 1153
Outliers: 326
PageBlocks
Total: 3500
Normal: 3171
Outliers: 329
SpamBase
Total: 2944
Normal: 1769
Outliers: 1175

Ioannis Antoniadis · Answer 2 · Tue Apr 05 2022 03:34:07 GMT+0800 (China Standard Time)

Remove PageBlocks dataset from the experiments since it's not useful (based on results it's showing), and include the KDDCUP99 dataset by subsampling only the normal class.