Implementation of different graph neural network (GNN) based models for anomaly detection in multivariate timeseries in sensor networks.
An explicit graph structure modelling the interrelations between sensors is inferred during training and used for time series forecasting. Anomaly detection is based on the error between the predicted and actual values at each time step.
- Python == 3.7
- cuda == 10.2
- [pytorch==1.8.1] (https://pytorch.org/)
- [torch-geometric==1.7.2] (https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html)
Additional package files for torch-geometric (python 3.7 & pytorch 1.8.1) provided in '/whl/' in case they are unavailable.
Refer to https://pytorch-geometric.com/whl/ for other versions.
Install python environment for example with conda:
conda create -n py37 python=3.7
Run install bash script with either cpu or cuda flag depeneding on the indended use.
# run after installing python
bash install.sh cpu
# or
bash install.sh cuda
The repository contains several models. GNN-LSTM is used by default and achieved best performance.
Model with GNN feature expansion before multi-layer LSTM. A single node embedding is used to infer the latent graph through vector similary, and as node positional embeddings added to the GNN features before they are passed to the recurrent network.
Spatial-Temporal Convolution GNN with attention. Data is split into an encoder and decoder. Encoder creates a feature representation for each time step while the decoder creates a single representation. Encoder-Decoder attention is concatenated with the decoder output before passed to the prediction layer.
Uses multiple embedding layers to parameterize the latent graph diretly by the network.
Inspired by: https://arxiv.org/pdf/1705.03122.pdf.
Sptial-Temporal Convolution GNN with attention and graph mix-hop propagation.
Taken from: https://arxiv.org/pdf/2005.11650.pdf.
Vanilla multi-layer LSTM used for benchmarking.
Test dataset ('demo') included in the model folder.
SWaT and WADI datasets can be requested from iTrust.
The files should be opened in e.g. Excel to remove the first empty rows and save as a .csv file.
The CSV files should be placed in a folder with the same name ('swat' or 'wadi') in '/datasets/files/raw/<name>/<file>'
Additional datasets can either be loaded directly from CSV file using the dataset 'from_csv'
or by creating a custom dataset following the examples found in the '/datasets/' folder.
If 'from_csv' is used, the data should come in the same format as the demo data included in this repository,
with individual time series for each sensor represented by a single column. (Only) the test data should have
anomaly labels included in the last column.
The first column is assumed to be the timestamp. The files are to be placed in '/datasets/files/raw/from_csv/'.
If this option is chosen, data normalization is not available. Any preprocessing should be done manually.
Suitable parameters for the SWaT, Wadi, and Demo datasets can be found in the bash scripts, which is the most convenient way to run models.
# run from terminal
sh run.sh [dataset]
Examples:
# example 1
sh run.sh swat
# example 2
sh run.sh wadi
# example 3
sh run.sh demo
Run the main.py script from your terminal (bash, powershell, etc).
To change the default model and training hyperparameters, flags can be included.
Alternatively, those parameters can be changed within the file (argsparser default values).
# run from terminal
python main.py -[flags]
Examples:
# example 1
python main.py -dataset demo -batch_size 4 -epochs 10
# example 2
python main.py -dataset swat -epochs 10 -topk 20 -embed_dim 128
# example 3
python main.py -dataset from_csv
Available flags:
-dataset
The dataset.
-window_size
Number of historical timesteps used in each sample.
-horizon
Number of prediction steps.
-val_split
Amount of data used for the validation dataset. Value between 0 and 1.
-transform
Sampling transform applied to the model input data (e.g. median).
-target_transform
Sampling transform applied to the model target values. (e.g. median, max).
-normalize
Boolean value if data normalization should be applied.
-shuffle_train
Boolean value if training data should be shuffled.
-batch_size
Number of samples in each batch.
-embed_dim
Number of node embedding dimensions (Disabled for GNN-LSTM).
-topk
Number of allowed neighbors for each node.
-smoothing
Error smoothing kernel size.
-smoothing_method
Error smoothing kernel type (mean or exp).
-thresholding
Thresholding method (mean, max, best (best performs an exhaustive search for theoretical performance evaluation)).
-epochs
Number of training epochs.
-early_stopping
Patience parameter of number of epochs without improvement for early stopping.
-lr
Learning rate.
-betas
Adam optimizer parameter.
-weight_decay
Adam optimizer weight regularization parameter.
-device
Computing device (cpu or cuda).
-log_graph
Boolean for logging of learned graphs.
After the initial run, a '/runs/' folder will be automatically created.
A copy of the model state dict, a loss plot, plots for the learned graph representation
and some additional information will be saved for each run of the model.
Visualization of a t-SNE embedding of the learned undirected graph representation for the SWaT dataset
with 15 neighbors per node.
Plot of a directly parameterized uni-directional graph adjaceny matrix with a single neighbor per node.
Node colors and labels indicate type of sensor.
P: Pump
MV: Motorized valve
UV: Dechlorinator
LIT: Level in tank
PIT: Pressure in tank
FIT: Flow in tank
AIT: Analyzer in tank (different chemical analyzers; NaCl, HCl, ORP meters, etc)
DPIT: Differential pressure indicating transmitter