Boost fraud detection accuracy and developer efficiency through Intel's end-to-end, no-code, graph-neural-networks-boosted and multi-node distributed workflows. Check out more workflow examples and reference implementations in the Developer Catalog.
- Overview
- Hardware requirements
- How It Works
- Getting Started
- Set Up and Run Single Node Pipeline with Docker
- Developer Containers for Advanced Users
- Run Docker Image in an Interactive Environment
- Run Using Argo Workflows on K8s Using Helm
- Set Up and Run Distributed Pipeline
- Expected Output
- Summary and Next Steps
- Learn More
- Support
Fraud detection has traditionally been tackled with classical machine learning algorithms such as gradient boosted machines. However, such supervised machine learning algorithms can lead to unsatisfactory precision and recall due to a few reasons:
- Severe class imbalance: ratio of fraud to non-fraud transactions is extremely imbalanced with typical values less than 1%
- Complex fraudster behavior which evolves with time: it is quite difficult to capture user behavior using traditional ML techniques
- Scale of data: credit card transaction datasets can have billions of transactions which require distributed preprocessing and training
- Latency of fraud detection: it is important to detect fraud quickly in order to minimize losses, thus highlighting the need for distributed inference
In Intel's Enhanced Fraud Detection reference kit, we employ Graph Neural Networks (GNN) popular for their ability to capture complex behavioral patterns (e.g., fraudsters performing multiple small transactions from different cards to not get caught). We also demonstrate a boost in accuracy by using GNN-boosted features over a baseline trained on traditional ML-only features. To make sure that we don't trade efficiency for accuracy, we enable distributed pipelines. Generally, distributed pipelines take weeks for data scientists to set up and involve numerous technical challenges. Our reference kit allows you to easily benefit from distributed capabilities and enjoy our no-code config-driven user interface. Additionally, we also provide ways for you to customize our solution to your own usecases.
- Significantly boost fraud classification accuracy by augmenting classical ML features with features generated through Graph Neural Networks (GNNs)
- Utilize our distributed preprocessing, training and inference pipelines to detect fraud quickly
- Improve developer efficiency and experimentation with our no-code, config-driven user interface To learn more, visit the Credit Card Fraud Detection GitHub repository.
Linux OS (Ubuntu 22.04) is used in this reference solution. Make sure the following dependencies are installed.
sudo apt update
Supported Hardware | Precision |
---|---|
Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors | FP32 |
Memory | >200GB |
Storage | >50GB |
To benefit from our distributed pipeline, please ensure that -
- Password-less ssh is set up on all the nodes.
- A network file system (NFS) is set up for all the nodes to access.
- A high-speed network between nodes is set up.
The high-level architecture of the reference use case is shown in the diagram below. We use a credit card transaction dataset open-sourced by IBM (commonly known as the tabformer dataset) in this reference use case to demonstrate the capabilities outlined in the Overview section.
The feature engineering stage ingests the raw data, encodes each column into features using the logic defined in the feature engineering config yaml file and saves processed data.
The GNN training stage creates homogenous graphs by consuming the processed data generated by Task 1 and trains a GraphSage model in a self-supervised link prediction task setting to learn the latent representations of the nodes (cards and merchants). Once the GNN model is trained, the GNN workflow will concatenate the card and merchant features generated by the model to the corresponding transaction features and saves the GNN-boosted features to a CSV file.
The XGBoost training stage trains a binary classification model using the data splitting, model parameters and runtime parameters set in the XGB training config yaml file. AUCPR (Area Under the Precision-Recall Curve) is used as the evaluation metric due to its robustness in evaluating highly imbalanced datasets. Data splitting is based on temporal sequence to simulate real-life scenario. The model performance on the tabformer dataset can be found in the table below.
Note : if you are using the distributed pipeline, please repeat steps 1, 2 and 3 on localdisk of all nodes as well as on NFS.
Create a working directory for the use case. Use these commands to set up the log folder, data folder and corresponding subfolders inside the working directory. We assume that your working directory is work.
export WORKDIR=$PWD/work
mkdir work && cd work
mkdir data ml_tmp gnn_tmp
cd data
mkdir raw_data edge_data node_edge_data
How to get the tabformer dataset -
- Download the transactions.tgz from https://github.com/IBM/TabFormer/tree/main/data/credit_card
- Upload the transactions.tgz file to the
$WORKDIR/data/raw_data
folder in your working directory - Unzip the transactions.tgz file
cd $WORKDIR/data/raw_data
tar -zxvf transactions.tgz
If you want to bring your own dataset, put your raw data in the $WORKDIR/data/raw_data
folder.
Clone the repository into your working directory.
cd $WORKDIR
git clone https://github.com/intel/credit-card-fraud-detection
cd fraud-detection-usecase
git submodule update --init --recursive
Your folder structure should follow the directory structure as shown in the figure below.
- Run with Docker on single node
- Run distributed pipelines
For advanced users, we provide options below: - Developer containers for advanced users
- Run Docker Image in an Interactive Environment
- Run Using Argo Workflows on K8s Using Helm
To run distributed pipelines, please refer to Run distributed pipelines. For advanced users, you can use interactive mode to run the workflow containers. You can also check out the Developer containers for advanced users section to change the source code in the workflow containers to adapt to your own use case.
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the Docker Engine Server installation instructions for details. To build and run this workload inside a Docker Container, ensure you have Docker Compose installed on your machine. If you don't have this tool installed please consult official Docker Compose installation documentation.
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
docker compose version
Ensure you have downloaded the dataset as described in Prepare data directory and set the path to the dataset as described below.
# pass the absolute path to /work/data folder
export DATASET_DIR=<path-to-data-dir>
You can choose to pull or build the containers that are going to be used to run the pipeline.
cd docker
docker compose build
OR
cd docker
docker pull intel/ai-workflows:beta-fraud-detection-classical-ml
docker pull intel/ai-workflows:beta-fraud-detection-gnn
Note : if you're running distributed pipeline, pull the classical-ml docker image on all nodes.
Run the preprocessing using the following command:
docker compose run preprocess 2>&1 | tee preprocess.log
The preprocess
workflow will ingest the raw data in the $DATASET_DIR/raw_data/card_transaction.v1.csv
directory, generate a preprocessed CSV file, and save it in the $OUTPUT_DIR/data/edge_data/
directory.
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Output Preprocessed Dataset directory. |
The preprocess pipeline must complete successfully before running the baseline training. The preprocess pipeline generates an output CSV file at $OUTPUT_DIR/data/edge_data/
.
The baseline-training
workflow will consume the CSV file generated from preprocess
above, and run a training of a XGBoost model. It will also print out AUCPR (area under the precision-recall curve) results to the console.
%%{init: {'theme': 'dark'}}%%
flowchart RL
VDATASETDIRrawdata{{"${DATASET_DIR}/raw_data"}} x-. /workspace/data/raw_data .-x preprocess
VOUTPUTDIRdataedgedata{{"${OUTPUT_DIR}/data/edge_data"}} x-. /workspace/data/edge_data .-x preprocess
VCONFIGDIR{{"${CONFIG_DIR"}} x-. "-$PWD/../configs/single-node}" .-x preprocess
VCONFIGDIR x-. "-$PWD/../configs/single-node}" .-x baselinetraining[baseline-training]
VOUTPUTDIRdataedgedata x-. /workspace/data/edge_data .-x baselinetraining
VOUTPUTDIRbaselinemodels{{"${OUTPUT_DIR}/baseline/models"}} x-. /workspace/tmp/models .-x baselinetraining
VOUTPUTDIRbaselinelogs{{"${OUTPUT_DIR}/baseline/logs"}} x-. /workspace/tmp/logs .-x baselinetraining
classDef volumes fill:#0f544e,stroke:#23968b
class VDATASETDIRrawdata,VOUTPUTDIRdataedgedata,VCONFIGDIR,VCONFIGDIR,VOUTPUTDIRdataedgedata,VOUTPUTDIRbaselinemodels,VOUTPUTDIRbaselinelogs volumes
Run the workflow container with the command below.
docker compose run baseline-training 2>&1 | tee baseline-training.log
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Output directory. Should be the same as set in Step-2 |
Step 3: Train and Evaluate XGBoost with both edge features and GNN generated node features for Fraud Classification
To see the improvement over the baseline training you can run the xgb-training pipeline. Before running the baseline training, the preprocess pipeline must complete successfully. The preprocess pipeline generates an output CSV file at $OUTPUT_DIR/data/edge_data/
. The xgb-training
workflow consumes the CSV file generated from preprocess
above, runs the gnn-analytics
pipeline to generate optimized features, and runs a training of a XGBoost model using these features. It will also print out AUCPR (area under the precision-recall curve) results to the console.
Note : as this step runs GNN training first, we don't expect to see output for a while. Once GNN training finishes, we will start seeing output from XGBoost training. Instructions to check GNN log are provided in logs section.
%%{init: {'theme': 'dark'}}%%
flowchart RL
VDATASETDIRrawdata{{"${DATASET_DIR}/raw_data"}} x-. /workspace/data/raw_data .-x preprocess
VOUTPUTDIRdataedgedata{{"${OUTPUT_DIR}/data/edge_data"}} x-. /workspace/data/edge_data .-x preprocess
VCONFIGDIR{{"${CONFIG_DIR"}} x-. "-$PWD/../configs/single-node}" .-x preprocess
VOUTPUTDIRdataedgedataprocesseddatatrimmedcsv{{"${OUTPUT_DIR}/data/edge_data/processed_data_trimmed.csv"}} x-. /DATA_IN/processed_data.csv .-x gnnanalytics[gnn-analytics]
VOUTPUTDIRdatanodeedgedata{{"${OUTPUT_DIR}/data/node_edge_data"}} x-. /DATA_OUT .-x gnnanalytics
VOUTPUTDIRgnncheckpoint{{"${OUTPUT_DIR}/gnn_checkpoint"}} x-. /GNN_TMP .-x gnnanalytics
VCONFIGDIR x-. "-$PWD/../configs/single-node}" .-x gnnanalytics
VOUTPUTDIRdatanodeedgedata x-. /workspace/data/node_edge_data .-x xgbtraining[xgb-training]
VOUTPUTDIRxgbtrainingmodels{{"${OUTPUT_DIR}/xgb-training/models"}} x-. /workspace/tmp/models .-x xgbtraining
VOUTPUTDIRxgbtraininglogs{{"${OUTPUT_DIR}/xgb-training/logs"}} x-. /workspace/tmp/logs .-x xgbtraining
VCONFIGDIR x-. "-$PWD/../configs/single-node}" .-x xgbtraining
xgbtraining --> gnnanalytics
classDef volumes fill:#0f544e,stroke:#23968b
class VDATASETDIRrawdata,VOUTPUTDIRdataedgedata,VCONFIGDIR,VOUTPUTDIRdataedgedataprocesseddatatrimmedcsv,VOUTPUTDIRdatanodeedgedata,VOUTPUTDIRgnncheckpoint,VCONFIGDIR,VOUTPUTDIRdatanodeedgedata,VOUTPUTDIRxgbtrainingmodels,VOUTPUTDIRxgbtraininglogs,VCONFIGDIR volumes
Run the workflow container with the command below.
docker compose run xgb-training 2>&1 | tee xgb-training.log
This command runs the gnn-analytics
pipeline to generate the node features first and then uses edge features generated from Step 2 to train the XGBoost model and will print out AUCPR (area under the precision-recall curve) results to the console.
Note : This steps runs the GNN training first which can take several hours to finish.
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Dataset, Logfile and Checkpoint output. Should contain ${OUTPUT_DIR}/data/edge_data/processed_data.csv |
You can also check logs of different pipelines using the following commands.
docker compose logs gnn-analytics -f
Run these commands to check the preprocess, baseline, and xgb-training logs:
cat preprocess.log
cat baseline-training.log
cat xgb-training.log
If you want to edit and run your own version of a workflow with the same docker containers you can use the dev pipelines below. The environment variables you can control are described below.
This pipeline allows you to control the source code, scripts, and parameters for running the baseline XGBoost pipeline. This takes a preprocess dataset and trains an XGBoost model.
docker compose run dev-baseline-training
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Dataset, Logfile and Checkpoint output. Should contain ${OUTPUT_DIR}/data/edge_data/processed_data.csv |
PARAMETER | --baseline-training |
Script file parameter to control which pipeline to run from (--baseline-training/--gnn-analytics/--xgb-training) |
SCRIPT | /fraud-detection/wrapper.sh |
Path of Script inside the container. |
WORKSPACE | $PWD/../classical-ml |
Submodule Location of the classical-ml |
This pipeline allows you to control the source code, scripts, and parameters for running the GNN Analytics pipeline. This takes a preprocess dataset and generates the node features described in GNN Training.
docker compose run dev-gnn-analytics
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Dataset, Logfile and Checkpoint output. Should contain ${OUTPUT_DIR}/data/edge_data/processed_data.csv |
PARAMETER | --gnn-analytics |
Script file parameter to control which pipeline to run from (--baseline-training/--gnn-analytics/--xgb-training) |
SCRIPT | /fraud-detection/wrapper.sh |
Path of Script inside the container. |
WORKSPACE | $PWD/../gnn-analytics |
Submodule Location of the gnn-workflow |
This pipeline allows you to control the source code, scripts, and parameters for running the GNN Boosted XGBoost training pipeline. This takes the GNN boosted dataset and runs the training described in XGBoost Training.
docker compose run dev-xgb-training
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | $PWD/../configs/single-node |
Directory of the config file |
OUTPUT_DIR | $PWD/output |
Dataset, Logfile and Checkpoint output. Should contain ${OUTPUT_DIR}/data/edge_data/processed_data.csv |
PARAMETER | --xgb-training |
Script file parameter to control which pipeline to run from (--baseline-training/--gnn-analytics/--xgb-training) |
SCRIPT | /fraud-detection/wrapper.sh |
Path of Script inside the container. |
WORKSPACE | $PWD/../classical-ml |
Submodule Location of the classical-ml |
Run the following command to stop all services and containers created by docker compose and remove them.
docker compose down
You can also choose to run the container in an interactive manner to execute the workflows manually after getting inside the containers. If your environment requires a proxy to access the internet, export your development system's proxy settings to the docker environment:
export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \
-e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \
-e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \
-e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \
-e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \
-e SOCKS_PROXY=${SOCKS_PROXY}"
Use the following command to run the baseline-training
workflow container interactively.
export DATASET_DIR=/path/to/data
export OUTPUT_DIR=$PWD/output
export WORKSPACE=$PWD/../classical-ml
export WRAPPER_SCRIPT=$PWD/../wrapper.sh
export CONFIG_DIR=$PWD/../configs/single-node
docker run -a stdout ${DOCKER_RUN_ENVS} \
-v ${DATASET_DIR}/raw_data:/fraud-detection/data/raw_data \
-v ${CONFIG_DIR}:/fraud-detection/configs \
-v ${OUTPUT_DIR}/data/edge_data:/fraud-detection/data/edge_data \
-v ${OUTPUT_DIR}/baseline/models:/workspace/tmp/models \
-v ${WORKSPACE}:/fraud-detection/classical-ml \
--privileged --init -it --rm --pull always \
-w /fraud-detection/classical-ml \
intel/ai-workflows:beta-fraud-detection-classical-ml \
bash
Run the command below for preprocessing and baseline training
/fraud-detection/wrapper.sh --preprocess --baseline-training
- Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && \
chmod 700 get_helm.sh && \
./get_helm.sh
- Install Argo Workflows and Argo CLI
- Configure your Artifact Repository
Ensure that you have reviewed the Hardware Requirements and each of your nodes has sufficient memory for the GNN workflow.
export NAMESPACE=argo
helm install --namespace ${NAMESPACE} --set proxy=${http_proxy} fraud-detection ./chart
argo submit --from wftmpl/fraud-detection --namespace=${NAMESPACE}
To view your workflow progress
argo logs @latest -f
To run with Docker on single node, refer to Run with Docker on single node.
- Password-less ssh needs to be set up on all the nodes that you are using.
- Classical ML workflow requires code, configs and data directory to reside locally.
- GNN workflow requires the data directory and code repository to reside on Network File System (NFS).
- The nodes should be connected with a high-speed network to enjoy speedups.
Please repeat step 1 through 3 in section Set up and run single node pipeline with docker on all nodes.
We allow the users to bring their own dataset, create their own preprocessing engine, build custom graph dataset from processed csv files as well as construct their own XGBoost and GraphSage model. Please find more information in Customize our reference kit to your own usecase.
- Prepare the workflow config yaml (workflow-data-preprocessing.yaml) such that it reflects the number of nodes you wish to use and their IP addresses.
env:
num_node: 2
node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end
- IP1
- IP2
# If you followed our setup instructions, you can get <path-to-work-dir> by running $WORKDIR in your terminal
tmp_path: <path-to-work-dir-on-localdisk>/ml_tmp
data_path: <path-to-work-dir-on-localdisk>/data
config_path: <path-to-work-dir-on-localdisk>/fraud-detection-usecase/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and launch the workflow container from master node with the command below.
# on master node
cd <path-to-work-dir-on-localdisk>/fraud-detection-usecase/classical-ml
./run-workflow.sh <path-to-work-dir-on-localdisk>/fraud-detection-usecase/configs/distributed/workflow-data-preprocessing.yaml
The Classical ML workflow saves the processed data inside /work/data/edge_data
folder on the local disk of the master node.
- Prepare the workflow config yaml (workflow-baseline.yaml) that specifies the number of nodes and their IP addresses.
env:
num_node: 2
node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end
- IP1
- IP2
# If you followed our setup instructions, you can get <path-to-work-dir> by running $WORKDIR in your terminal
tmp_path: <path-to-work-dir-on-localdisk>/ml_tmp
data_path: <path-to-work-dir-on-localdisk>/data
config_path: <path-to-work-dir-on-localdisk>/fraud-detection-usecase/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and run the workflow container with the command below.
# on master node
cd <path-to-work-dir-on-localdisk>/fraud-detection-usecase/classical-ml
./run-workflow.sh <path-to-work-dir-on-localdisk>/fraud-detection-usecase/configs/distributed/workflow-baseline.yaml
- Copy the processed data from localdisk of master node to NFS.
# on master node
scp <path-to-work-dir-on-localdisk>/data/edge_data/processed_data.csv <path-to-work-dir-on-NFS>/data/edge_data/
- Prepare the workflow config yaml (workflow-gnn-training.yaml) by specifying the number of nodes and their IP addresses.
env:
num_node: 2
node_ips:
- IP1
- IP2
#tmp_path used to save model, embeddings, partitions...
tmp_path: <path-to-work-dir-on-NFS>/gnn_tmp
#data_path should contain processed_data.csv
data_path: <path-to-work-dir-on-NFS>/data/edge_data
#in_data_filename is the name of input csv file
in_data_filename: processed_data.csv
#out_path will contain the output csv with the tabular data and new node embeddings
out_path: <path-to-work-dir-on-NFS>/data/node_edge_data
#config_path will contain all three configs required by GNN workflow
config_path: <path-to-work-dir-on-NFS>/fraud-detection-usecase/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and run the workflow container with the command below.
# on master node
cd <path-to-work-dir-on-NFS>/fraud-detection-usecase/gnn-analytics
./run-workflow.sh <path-to-work-dir-on-NFS>/fraud-detection-usecase/configs/distributed/workflow-gnn-training.yaml
Distributed GNN workflow will save GNN-boosted features to /work/data/node_edge_data/
folder on NFS.
- Copy GNN-boosted data from NFS to localdisk on master node.
# on master node
scp <path-to-work-dir-on-NFS>/data/node_edge_data/tabformer_with_gnn_emb.csv <path-to-work-dir-on-localdisk>/data/node_edge_data/
- Prepare the workflow config yaml (workflow-xgb-training.yaml) that specifies the parameters for running the workflow for final XGBoost model training.
env:
num_node: 2
node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end
- IP1
- IP2
# If you followed our setup instructions, you can get <path-to-work-dir> by running $WORKDIR in your terminal
tmp_path: <path-to-work-dir-in-localdisk>/ml_tmp
data_path: <path-to-work-dir-in-localdisk>/data
config_path: <path-to-work-dir-in-localdisk>/fraud-detection-usecase/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and run the workflow container with the command below.
# on master node in localdisk
cd <path-to-work-dir-in-localdisk>/fraud-detection-usecase/classical-ml
./run-workflow.sh <path-to-work-dir-in-localdisk>/fraud-detection-usecase/configs/distributed/workflow-xgb-training.yaml
We use Area Under the Precision Recall Curve (AUCPR) as evaluation metric (lies between 0 and 1, higher is better).
You should expect to see a boost in AUCPR for test split by using GNN-boosted features.
Data split | Number of examples | AUCPR - Edge features only | AUCPR - GNN-boosted features |
---|---|---|---|
Train (year < 2018) | 20,604,847 | 0.92 | 0.98 |
Val (year = 2018) | 1,689,822 | 0.91 | 0.93 |
Test (year > 2018) | 1,904,167 | 0.88 | 0.94 |
You should expect to see a boost in AUCPR for test split by using GNN-boosted features.
Data split | Number of examples | AUCPR - Edge features only | AUCPR - GNN-boosted features |
---|---|---|---|
Train (year < 2018) | 20,604,847 | 0.95 | 0.96 |
Val (year = 2018) | 1,689,822 | 0.90 | 0.95 |
Test (year > 2018) | 1,904,167 | 0.88 | 0.94 |
You will see logs that look similar to the ones below once you run the use case successfully. Please note that the timing numbers depend on the hardware systems.
Preprocessing dataset
Failed to read model training configurations. This is either due to wrong parameters defined in the config file as shown: 'training'
Or there is no need for model training.
enter single-node mode...
reading data...
(24386900, 15)
preparing data...
engineering features...
splitting data...
encoding features...
saving data...
data saved under the path /fraud-detection/data/edge_data/processed_data.csv
Our workflow runs hyperparameter optimization by default and prints the best trial's hyperparameters.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for training Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 26)
start training models soon...
(24198836, 22)
read and prepare data for training...
start xgboost HPO...
0%| | 0/10 [00:00<?, ?it/s][0] train-aucpr:0.39670 eval-aucpr:0.26043 test-aucpr:0.26432
[999] train-aucpr:0.98697 eval-aucpr:0.87243 test-aucpr:0.81485
Best trial: 0. Best value: 0.872425: 10%|â–ˆ | 1/10 [22:23<3:21:33, 1343.71s/it][I 2023-04-26 23:19:42,948] Trial 0 finished with value: 0.8724252645144079 and parameters: {'eta': 0.15, 'max_depth': 7, 'subsample': 0.5580179751710447, 'colsample_bytree': 0.8829791679963901, 'lambda': 0.6988028968743126, 'alpha': 0.9281407432733514, 'min_child_weight': 2}. Best is trial 0 with value: 0.8724252645144079.
...
[0] train-aucpr:0.20233 eval-aucpr:0.11631 test-aucpr:0.11132
[999] train-aucpr:0.93912 eval-aucpr:0.88756 test-aucpr:0.81800
Best trial: 7. Best value: 0.91189: 100%| 10/10 [3:32:04<00:00, 1272.49s/it]
Trial 9 finished with value: 0.8875619161413042 and parameters: {'eta': 0.17, 'max_depth': 5, 'subsample': 0.8748137912212397, 'colsample_bytree': 0.9733703420536219, 'lambda': 0.565297835149586, 'alpha': 0.43564481669099264, 'min_child_weight': 2}.
Best is trial 7 with value: 0.9118896651018639.
Value: 0.9118896651018639
Params:
eta: 0.1
max_depth: 9
subsample: 0.6372646065696512
colsample_bytree: 0.5940969469756718
lambda: 0.023810403412340413
alpha: 0.4955030354495986
min_child_weight: 9
aucpr of the best configs on test set is 0.8779567630966963
We saved our best model's hyper-parameters in $WORKDIR/fraud-detection-usecase/configs/single-node/baseline-xgb-training.yaml
. Simply comment hpo_spec
section and uncomment model_spec
section to reproduce our results.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for HPO
Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 26)
start training models soon...
(24198836, 22)
read and prepare data for training...
start xgboost model training...
[0] train-aucpr:0.41938 eval-aucpr:0.41885 test-aucpr:0.36533
[100] train-aucpr:0.78485 eval-aucpr:0.90691 test-aucpr:0.89396
[200] train-aucpr:0.81402 eval-aucpr:0.92188 test-aucpr:0.89991
[300] train-aucpr:0.84193 eval-aucpr:0.92164 test-aucpr:0.89123
[400] train-aucpr:0.86475 eval-aucpr:0.91879 test-aucpr:0.88308
[500] train-aucpr:0.87911 eval-aucpr:0.91601 test-aucpr:0.87810
[600] train-aucpr:0.88980 eval-aucpr:0.91699 test-aucpr:0.87880
[700] train-aucpr:0.89974 eval-aucpr:0.91588 test-aucpr:0.88025
[800] train-aucpr:0.90777 eval-aucpr:0.91510 test-aucpr:0.87856
[900] train-aucpr:0.91517 eval-aucpr:0.91347 test-aucpr:0.87711
[999] train-aucpr:0.92204 eval-aucpr:0.91267 test-aucpr:0.87582
start xgboost model testing...
testing results: aucpr on test set is 0.8758093307052992
xgboost model is saved under /workspace/tmp/models.
Our workflow runs hyperparameter optimization by default and prints the best trial's hyperparameters.
Creating Container docker-gnn-analytics-1
Created Container docker-gnn-analytics-1
Starting Container docker-gnn-analytics-1
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for training Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 154)
start training models soon...
(24198836, 150)
read and prepare data for training...
start xgboost HPO...
[0] train-aucpr:0.19283 eval-aucpr:0.03559 test-aucpr:0.02577
[999] train-aucpr:0.94984 eval-aucpr:0.89502 test-aucpr:0.86376
Best trial: 0. Best value: 0.89502: 10%|â–ˆ | 1/10 [17:27<2:37:05, 1047.28s/it][I 2023-04-27 21:51:02,449] Trial 0 finished with value: 0.8950199939942034 and parameters: {'eta': 0.2, 'max_depth': 4, 'subsample': 0.581804256586296, 'colsample_bytree': 0.3397014497235425, 'lambda': 0.4620219935728601, 'alpha': 0.8976349208109945, 'min_child_weight': 3}. Best is trial 0 with value: 0.8950199939942034.
...
Best trial: 5. Best value: 0.930132: 100%| 10/10 [2:48:54<00:00, 1013.50s/it]
[I 2023-04-28 00:22:30,154] Trial 9 finished with value: 0.9054094235120606 and parameters: {'eta': 0.1, 'max_depth': 5, 'subsample': 0.9740220576298498, 'colsample_bytree': 0.5915481518062535, 'lambda': 0.7854898534100563, 'alpha': 0.6962425381327314, 'min_child_weight': 3}.
Best is trial 5 with value: 0.930131689676218.
Value: 0.930131689676218
Params:
eta: 0.01
max_depth: 6
subsample: 0.8714669008983891
colsample_bytree: 0.8825897760478416
lambda: 0.4397103901613584
alpha: 0.8475402634335466
min_child_weight: 8
aucpr of the best configs on test set is 0.9224617296126916
We saved our best model's hyper-parameters in $WORKDIR/fraud-detection-usecase/configs/single-node/xgb-training.yaml
. Simply comment hpo_spec
section and uncomment model_spec
section to reproduce our results.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for HPO
Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 154)
start training models soon...
(24198836, 150)
read and prepare data for training...
start xgboost model training...
[0] train-aucpr:0.48145 eval-aucpr:0.23717 test-aucpr:0.21672
[100] train-aucpr:0.80701 eval-aucpr:0.84678 test-aucpr:0.83616
[200] train-aucpr:0.85851 eval-aucpr:0.93348 test-aucpr:0.95080
[300] train-aucpr:0.88907 eval-aucpr:0.93487 test-aucpr:0.95312
[400] train-aucpr:0.91095 eval-aucpr:0.93121 test-aucpr:0.95203
[500] train-aucpr:0.92976 eval-aucpr:0.93077 test-aucpr:0.95142
[600] train-aucpr:0.94440 eval-aucpr:0.92910 test-aucpr:0.94902
[700] train-aucpr:0.95611 eval-aucpr:0.92940 test-aucpr:0.94795
[800] train-aucpr:0.96405 eval-aucpr:0.92913 test-aucpr:0.94606
[900] train-aucpr:0.97143 eval-aucpr:0.92884 test-aucpr:0.94424
[999] train-aucpr:0.97721 eval-aucpr:0.92824 test-aucpr:0.94275
start xgboost model testing...
testing results: aucpr on test set is 0.9427465838947949
xgboost model is saved under /workspace/tmp/models.
The steps above demonstrate accuracy boost through using GNN's and efficiency boost through setting up distributed pipelines as well as a no-code user experience through configs.
To customize our reference kit to support your needs, you can -
- Bring your own dataset: You can add your own source data to
/data/raw_data
. - Bring your own preprocessor : You can create your own preprocessor (i.e. edge featurizer) by editing
data-preprocessing.yaml
. More information on how to write config yaml file can be found in the Classical ML workflow GitHub repository. - Bring your own baseline : You can set parameters of your baseline model such as learning_rate, eval_metric, num_boost_round, verbose_eval and so on in
baseline-xgb-training.yaml
. - Bring your own Graph: You can create your own graph by defining node columns and edge types in
tabular2graph.yaml
. - Bring your own GNN : You can set parameters of your GraphSage model such as learning_rate, fan_out, epochs, eval_every and so on in
gnn-training.yaml
. - Bring your own XGB model : You can set parameters of your final XGB model such as learning_rate, eval_metric, num_boost_round, verbose_eval and so on in
xgb-training.yaml
if you want to edit final XGB model.
Make sure to edit your configs in/configs/single-node
folder if you're using single-node setting and/configs/distributed
if you're using distributed setting.
To learn more about our workflows, please refer to -
To troubleshoot, please submit your github issue.