Experimental setup

Question

Experimental setup

johnantonn opened this issue 3 years ago · comments

Ioannis Antoniadis commented 3 years ago

Data:

https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

Algorithms:

https://pyod.readthedocs.io/en/latest/

From an initial set of datasets and algorithms, track their perfornance times and then setup a grid-search for e.g. 6 datasets and 6-7 algorithms.

After that, we'll have a golden standard and we can run the bandit algorithms and test their performance against the golden standards.

Older comments on grid search
Use a number of datasets for the baseline:

https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/
The idea is to apply a full grid search for these dataset and a specified number of classifiers for one, and then just apply a uniform selection of each arm and at the end provide the best (pure exploration setting).

These two approaches can be used as baselines for our problem.

Ioannis Antoniadis · Answer 1 · Fri Apr 16 2021 18:48:47 GMT+0800 (China Standard Time)

After thorough search, seems that GridSearchCV uses cross validation and is meant for hyper parameters optimization in the case of supervised learning (classification or linear regression).

In the case of unsupervised learning, a brute force approach must be followed, i.e. loop over all datasets, models, and their hyper-parameter values and obtain the scoring values.

This has been tested for one dataset and KNN. It will be extended to all datasets and models(dataset-specific and model-specific code has to be put in place for that) to obtain the scores for all datasets/models.

Ioannis Antoniadis · Answer 2 · Wed Apr 21 2021 05:44:27 GMT+0800 (China Standard Time)

In a nutshell, the current logic of the code is as follows:

For each dataset:
- Import dataset and create X_train, X_test and y_test.
- Define models and hyperparams (this step can be brought out of the loop, right now models and their hyperparam spaces don't depend on the datasets).
- For each model:
  - Define the dict of all possible combinations of hyperparam values and shuffle it (randomness).
  - Set a timer according to a predefined timeout (e.g. 2mins) and start fitting/predicting/evaluating the model for randomly selecting sets of hyperparams from the shuffled dict. Keep track of the searched space ratio. Keep track of all scores.
  - Determine the best score and hyperparams for each model.
- Determine the best model, hyperparams and score for the dataset.

Ioannis Antoniadis · Answer 3 · Tue Apr 27 2021 03:06:46 GMT+0800 (China Standard Time)

Restructured the code to create scripts for the different scenarios of optimal model search:

grid_search.py: with a timeout per model (loops over datasets and models, sets a timeout for each model, samples randomly hyperparam sets for each one)
random_search.py: for a total timeout for each dataset (loops over datasets, samples random models and random hyperparam sets for the sampled model to find the best model instance)
bandit_search.py: the bandit algorithm TODO

Added:

evaluate_model_family(): takes as arguments the model instance and the full set of hyperparams and returns the best model intance and scores, evaluated on training/validation sets.
evaluate_model_instance(): takes as arguments the model instance and a single hyperparam instance and returns a scores dictionary including arbitrary scores such as f1 and auc.

functions to functions.py script.

Ioannis Antoniadis · Answer 4 · Fri May 07 2021 18:26:48 GMT+0800 (China Standard Time)

Comments/additions:

Fit scaling method on training data only
Build in some try/except blocks
Dump results to files
Include inference time in elapsed time
Use stratified split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Scaling should happen on the training data (fit) and then applied to test and validation set. Validation set should be ~200 points. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

The training, validation and test sets need to be constructed and exported to files, so they can be imported then from the scripts. The models shouldn't have to re-compute the same things, instead the trained models should be saved along with their hyperparams and y_labels and scoring values for the validation set, and the execution times. That way, each time the same model instance needs to recompute, it will just read these values from external files in the appropriate format.

Ioannis Antoniadis · Answer 5 · Fri Nov 19 2021 20:57:13 GMT+0800 (China Standard Time)

Exp 1:
dataset size = min(dataset_size, 5000)
5 datasets, 10 models:
Average time of all different models that are going to be applied (* 200)
Timeout would then be a multiple of that average time + 3 * std
validation set should be half the training points

Exp 2:
Choose a specified timing and vary the size of the validation set
Start with 10/20 labels
Increase the label budget + 20 up to 200
Run autosklearn and observe ROC AUC on the test set and f1 score

Extras:
1. Logging: store models, hyperparams, scores, store the trained model or hyperparams
2. Myabe try: Google colab if more computational power is required
3. Plot the results (present the results)

Ioannis Antoniadis · Answer 6 · Wed Feb 09 2022 00:41:06 GMT+0800 (China Standard Time)

Due to significant changes in the topic and experimental setup, continuing in new issue #23