This is a repository for a 1 week labrotation about active learning using random forest. This project uses modAL.
The requirements for this project are in requirements.txt. See run_main.ipynb for a concrete example on how to use the code. The main entry point to the project's code is src/main.py. One can also see at the bottom of it what possible arguments can be provided to it. Make sure that the dataset you use is either of type numpy.ndarray or scipy.sparse.csr_matrix.
Contains .json files for specifying by what metrics to measure the active learner
Contains .json files for specifying the learner's hyperparameter search space. Parameters from the search space will be sampled using the sklearn ParameterSampler.
Contains .json files for specifying kwargs for the query strategy of choice.
Directory for saving results. The names are in the form of by "<params file name>, <query strategy config file name>, <dataset name>.json". Result files contain for every tried hyperparameter tuple: validation scores, mean query times and test scores of the model with the best default validation score; the configuration dictionary for the query strategy.
As of 28.05.2021, all the results used StratifiedKFold. Also, while result 1 and result 2 were trained on my local machine, result 3, result 4 and result 5 were trained on Google Colab's CPU. So while result 1 and result 2 should have almost identical run-time due to their very similar settings, result 1's mean query time is ~13.7 secs and result 2's mean query time is ~12.5 secs. As such, my machine's query time should appear around 1.0917918481109985 secs slower than those from Google Colab (number calculated using numpy).
Contains the source code.
Contains test code. However, test_main.py was not exactly used, as I had problems with setting up some import stuff.
Provides a concrete example on how to use the code.