This repository contains the code for my thesis "Empirical Evaluation of Machine Learning Models and Ensemble Methods for Imbalanced Learning and Anomaly Detection" on the topic of imbalanced fraud detection. Despite the thesis is written in German, the code and repository are in English.
├── data
│ ├── creditcard
│ └── kddcup
├── model_utils
├── models
├── requirements.txt
A requirements.txt
file is provided in the root directory of the repository. To install the required packages, run the following command:
pip install -r requirements.txt
This will also install the local package model_utils
which is used for the unified evaluation of all models.
NOTE: The model_utils
package is not yet available on PyPi. A virtual environment is also recommended for the usage of this repository.
Only the Credit Card Fraud 2013 dataset is included in the repository. However, the original dataset is in the Rdata
format and must be converted to a CSV
file first using the provided R-script Rdata2CSV.ipynb
. Preprocessing can be done using the provided Jupyter Notebook CC_preprocessing.ipynb
.
For the KDDCUP99-dataset, preprocessing can be done using the provided Jupyter Notebook KDDCUP_preprocessing.ipynb
. The dataset corrected.gz
can be downloaded from the UCI KDD Archive. Prepare the dataset by running the provided Jupyter Notebook KDD_preprocessing.ipynb
.
The models
directory contains the code for all 18 models, sampling, boosting and imputation methods. The entire code is provided in Jupyter Notebooks.