johnantonn / cash-for-unsupervised-ad

Systematic Evaluation of CASH Search Strategies for Unsupervised Anomaly Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Process original datasets to create static train/test splits

johnantonn opened this issue · comments

Incorporate the below marked datasets after preprocessing them first. The process should be as follows:

  • Download the original versions of each dataset, unnormalized, without duplicates.
  • Load, subsample, split to train test, normalize on train and apply to test, and finally save as files.

Inkeddatasets_LI

Datasets, subsampled to 5000 points and stratified split to train/test with train_size = 30%:

  • Training set: 3500 or less
  • Test set: 1500 or less

Details about the training sets (from which the validation sets will be generated):

  • ALOI
    Total: 3500
    Normal: 3383
    Outliers: 117

  • Annthyroid
    Total: 3500
    Normal: 3232
    Outliers: 268

  • Waveform
    Total: 2410
    Normal: 2340
    Outliers: 70

  • Cardiotocography
    Total: 1479
    Normal: 1153
    Outliers: 326

  • PageBlocks
    Total: 3500
    Normal: 3171
    Outliers: 329

  • SpamBase
    Total: 2944
    Normal: 1769
    Outliers: 1175

Remove PageBlocks dataset from the experiments since it's not useful (based on results it's showing), and include the KDDCUP99 dataset by subsampling only the normal class.