johnantonn / cash-for-unsupervised-ad

Systematic Evaluation of CASH Search Strategies for Unsupervised Anomaly Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AutoML

johnantonn opened this issue · comments

References:

The discipline particular to our interest is more Model Selection and Hyperparameter Optimization (CASH) rather than meta-learning or AutoML (the last two are more general). There are several approaches to it:

  • Black box optimization techniques:
    • Grid Search
    • Random Search
    • Bayesian Optimization
  • Multi-fidelity techniques:
    • Bandits
  • ..

The problem:

  • CASH: Combined Algorithm Selection and Hyperparameter Optimization

Two major classses of AutoML optimizers:

  • Simple optimizers: only take care of model/hyperparameter selection
  • Pipeline optimizers: may also include preprocessing components

Research on implemented tools:

  • Auto-Weka:
    • Java
    • Pipeline optimizer
    • Bayesian optimization (SMAC)
  • scikit-optimize:
    • Python, on top of scikit-learn
    • Simple optimizer
    • Bayesian optimization (SMAC)
  • Hyperopt-sklearn
    • Python, on top of scikit-learn
    • Pipeline optimizer
    • Bayesian optimization (TPE)
  • Auto-sklearn
    • Python, on top of scikit-learn, improvement of the Auto-Weka methodology
    • Bayesian optimization (SMAC)
  • TPOT
    • Python, on top of scikit-learn
    • Pipeline optimizer
    • Genetic Programming (GP)
  • Hyperband
    • Python
    • Pipeline optimizer
    • Bandit-based
  • Optunity
    • Python
    • Simple optimizer
    • Includes several optimization algorithms:
      • Grid Search
      • Random Search
      • Particle Swarm Optimization
      • Nelder-Mead simplex
      • CMA-ES
      • TPE
      • Sobol sequences

Note: Major disadvantage of all of the state-of-the-art AutoML optimizers (either simple or pipeline) is that they provide pre-defined list of models and components to use.

Next up for auto-sklearn experimentation:

  • Inspect the validation procedure of auto-sklearn and modify it to accommodate reduced validation sets (get rid of unlabelled points or else it'll crush)
  • Incorporate additional AD models from PyOD