This github repo is prepared for KDD 2022 hands-on tutorial. The project pipelines are prepared using the templates with Kedro 0.18.0. Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
"Kedro is a development workflow framework which aims to become the industry standard for developing production-ready code. Kedro helps structure your data pipeline using software engineering principles, eliminating project delays due to code rewrites and thereby providing more time to focus on building robust pipelines. Additionally, the framework provides a standardised approach to collaboration for teams building robust, scalable, deployable, reproducible and versioned data pipelines." --QuantumBlack, a McKinsey company
An example of Kedro solution pipeline is shown below.
If you want to visualize the Kedro pipeline, please follow the instruction here. For this tutorial, we will skip this part.
We declared dependencies for different pipelines for different use cases and prepared shell script to install the virtual environment. Once the virtual environment is installed, you can run the notebook using the customized env/kernel. Also, user can run the corresponding pipeline after activating the virtual env.
- First ensure you are in the root directory of this repository:
sh-4.2$ pwd
/home/ec2-user/SageMaker/anomaly-detection-spatial-temporal-data-workshop
- Setting up the environment involves running the shell scripts in the
src/
folder. To do so, we will navigate to thesrc/
folder, and run each script sequentially.
cd src
bash prepare_eland_environment.sh
bash prepare_gdn_environment.sh
bash prepare_taddy_environment.sh
bash prepare_nab_environment.sh
bash prepare_ncad_environment.sh
- The above scripts will create a new python virtual environment, install the required python packages and register the environment with the Jupyter kernel. We can then activate one of the environments using
source
. For example the running the following command will activate thekedro-eland-venv
python virtual environment:
source kedro-eland-venv/bin/activate
We advise to download the datasets before coming to the live session, to have a copy on your local laptop.
After setting up the environments, we will download the datasets relevant to each use case. The notebook notebooks/download_data.ipynb walks through how to download the dataset. Alternatively, follow the manual instructions below.
The Reddit dataset is sourced from Pushshift and downloaded raw data file should be placed under data/01_raw/user_behavior
. Follow steps in notebooks/download_data.ipynb.
The Wifi network dataset is linked from the SpaMHMM repo and downloaded raw data files should be placed under data/01_raw/wifi
. Follow steps in notebooks/download_data.ipynb.
Note: This dataset is hosted on Kaggle. You will need a Kaggle account to be able to download this dataset.
Dataset can be download from here, please download the two csv files (bs140513_032310.csv and bsNET140513_032310.csv) and put them under : data/01_raw/financial_fraud
. Follow steps in notebooks/download_data.ipynb.
The IoT dataset is sourced from the BATADAL website and is placed under data/01_raw/iot
. This is done by the notebooks notebooks/download_data.ipynb and notebooks/industrial_iot/1.0-nk-batadal-exploration.ipynb. Please be sure to run one of these notebooks to obtain the dataset.
You can select the custom kernel after installing the corresponding virtual environment for each use case. For example, to run pipeline under the NCAD modeling framework, you can select the following icon on the instance
Under notebooks/user_behavior, choose kedro-eland-venv
. If the environment was set up correctly, the notebook will automatically choose the correct environment.
Under notebooks/telecom_network, choose kedro-gdn-venv
for notebook *gdn
. If the environment was set up correctly, the notebook will automatically choose the correct environment.
Under notebooks/financial_fraud, choose kedro-taddy-venv
for notebook 1.0, 1.1, 2.1, 3.1. Choose kedro-nab-venv
for notebook 1.2, 2.2.
Under notebooks/industrial_iot, choose kedro-gdn-venv
for notebook *gdn
. Choose kedro-nab-venv
for notebook *nab
, and kedro-ncad-venv
for notebooks *ncad
. If the environment was set up correctly, the notebook will automatically choose the correct environment.
First activate the virtual environment for the specific use case:
source src/<name_of_use_case>/bin/activate
You can run the entire pipeline for one use case with the corresponding activated virtual environment:
# make sure you are in the root directory
kedro run
You can also run your specific Kedro pipeline(sub-pipeline) with:
kedro run --pipeline <pipeline_name_in_registry>
If you want to run the pipeline with specific tags, you can run:
kedro run --pipeline <pipeline_name_in_registry> --tag <data_tag,model_tag>
You can even run your specific Kedro node function in the pipeline(sub-pipeline) with:
kedro run --node <node_name_in_registry>
For more details, you can run the command:
kedro run -h
You can run ELAND modeling framework for the Reddit user behavior anomaly use case.
To do this, follow the below steps. Since there are only one dataset using ELAND model. You won't need to the change input dataset name in conf/base/parameters.yml
- Activate the ELAND model virtual env:
source src/kedro-eland-venv/bin/activate
(you would need to install the virtual env first) - Run the pipeline:
kedro run
You can run NAB and GDN modeling framework for the Wifi network anomaly use case.
To do this, follow the below steps, replace <model>
with one of nab
, gdn
- Set input dataset to
wifi
inconf/base/parameters.yml
- Activate the relevant model virtual env:
source src/kedro-<model>-venv/bin/activate
(you would need to install the virtual env first) - Run the pipeline:
kedro run
You can run NAB and TADDY modeling framework for the financial fraud use case. For NAB, time series of amount spent for each unique (customer, category) pair is constructed. For TADDY, a dynamic interaction graph between customer and merchant is built. Each edge represents a transaction record between the customer and merchant.
To do this, follow the below steps, replace <model>
with one of nab
, taddy
- Set input dataset to
financial
inconf/base/parameters.yml
- Activate the relevant model virtual env:
source src/kedro-<model>-venv/bin/activate
(you would need to install the virtual env first) - Run the pipeline:
kedro run
You can run NAB, NCAD and GDN modeling framework for the IoT network anomaly use case. To do this, follow the below steps, replace <model>
with one of nab
, ncad
, gdn
- Set input dataset to
iot
inconf/base/parameters.yml
- Activate the relevant model virtual env:
source src/kedro-<model>-venv/bin/activate
- Run the pipeline:
kedro run
I. Introduction [5 mins]
II. Overview [10 mins]
- Overview of use-cases
- Telecom
- Consumer behavior
- Financial
- IoT
- Overview of algorithms
- NAB
- NCAD
- Eland
- GDN
- Taddy
- Mindmap
- Determine the right modeling framework for your data and anomaly type
- Overview of code repository
III. AWS Account and Environment Setup [20 mins] IV. In depth detail of algorithms [20 mins]
- NAB
- NCAD
- Eland
- GDN
- Taddy
V. Hands-on [2 hours]
- Downloading data sets
- Setting up local environments
- Training models using Kedro pipelines
- Training models using Jupyter Notebooks
VI. Conclusion and Take-away [5 mins]
- Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. 2017. Unsupervised real-time anomaly detection for streaming data.
- Anisa Allahdadi, Ricardo Morla, and Jaime S. Cardoso. 2018. 802.11 Wireless Simulation and Anomaly Detection using HMM and UBM.
- Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset.
- Chris U. Carmona, François-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus. 2021. Neural Contextual Anomaly Detection for Time Series.
- Jiho Choi, Taewook Ko, Younhyuk Choi, Hyungho Byun, and Chong-kwon Kim. 2021. Dynamic graph convolutional networks with attention mechanism for rumor detection on social media.
- Yuwei Cui, Chetan Surpur, Subutai Ahmad, and Jeff Hawkins. 2016. A comparative study of HTM and other neural network models for online sequence learning with streaming data.
- Ailin Deng and Bryan Hooi. 2021. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series.
- Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms – The Numenta Anomaly Benchmark.
- Yixin Liu, Shirui Pan, Yu Guang Wang, Fei Xiong, Liang Wang, Qingfeng Chen, and Vincent CS Lee. 2015. Anomaly Detection in Dynamic Graphs via Transformer.
- Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.
- Martin Happ, Matthias Herlich, Christian Maier, Jia Lei Du, and Peter Dorfinger. 2021. Graph-neural-network-based delay estimation for communication networks with heterogeneous scheduling policies.
- José Suárez-Varela et. al., The Graph Neural Networking Challenge: A Worldwide Competition for Education in AI/ML for Networks.
- Riccardo Taormina et. al.,The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks.
- Shen Wang and Philip S. Yu. 2022. Graph Neural Networks in Anomaly Detection. In Graph Neural Networks: Foundations, Frontiers, and Applications, Lingfei Wu, Peng Cui, Jian Pei, and Liang Zhao (Eds.).
- Yulei Wu, Hong-Ning Dai, and Haina Tang. 2021. Graph Neural Networks for Anomaly Detection in Industrial Internet of Things.
- Tong Zhao, Bo Ni, Wenhao Yu, Zhichun Guo, Neil Shah, and Meng Jiang, 2021. Action Sequence Augmentation for Early Graph-based Anomaly Detection.
- Li Zheng, Zhenpeng Li, Jian Li, Zhao Li, and Jun Gao. 2019. AddGraph: Anomaly Detection in Dynamic Graph Using Attention-based Temporal GCN.
On Debian/Ubuntu systems, you need to install the python3-venv package using the following command.
apt-get install python3-venv
This project is licensed under the Apache-2.0 License.