Feature extraction for CERT insider threat test dataset

This is a script for extracting features (csv format) from the CERT insider threat test dataset [1], [2], versions 4.1 to 6.2. For more details, please see this paper: Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning.

[1] Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1

[2] J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37.

Run feature_extraction script

Require python3, numpy, pandas, joblib. The script is written and tested in Linux only.
By default the script extracts week, day, session, and sub-session data (as in the paper).
To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded here), then run python3 feature_extraction.py
To change number of cores used in parallelization (default 8), use python3 feature_extraction.py numberOfCores, e.g python3 feature_extraction.py 16.

Extracted Data

Extracted data is stored in ExtractedData subfolder.

Note that in the extracted data, insider is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches.

Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in here.

Data representations

From the extracted data, temporal_data_representation.py can be used to generate different data representations, as presented in this paper: Anomaly Detection for Insider Threats Using Unsupervised Ensembles.

python3 temporal_data_representation.py --help

Sample classification and anomaly detection results

Sample code is provided in:

sample_classification.py for classification (as in Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning).
sample_anomaly_detection.py for anomaly detection (as in Anomaly Detection for Insider Threats Using Unsupervised Ensembles).

Citation

If you use the source code, or the extracted datasets, please cite the following paper:

D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.

Data representations and anomaly detection:

D. C. Le, N. Zincir-Heywood, "Anomaly Detection for Insider Threats Using Unsupervised Ensembles," in IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1152–1164. June 2021, doi:http://doi.org/10.1109/TNSM.2021.3071928.

ShijieQ / feature-extraction-for-CERT-insider-threat-test-datasets