Jieun-Enna / hai

HIL-based Augmented ICS (HAI) Security Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HIL-based Augmented ICS (HAI) Security Dataset

The HAI dataset was collected from a realistic industiral control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

Click here to find out more about HAI dataset.

Please e-mail us here if you have any questions about the dataset.

Contents

Background

  • In 2017, three laboratory-scale CPS testbeds were initially launched, namely GE’s turbine testbed, Emerson’s boiler testbed, and FESTO’s modular production system (MPS) water-treatment testbed. These testbeds were related to relatively simple processes, and were operated independently of each other.

  • In 2018, a complex process system was built to combine the three systems using a hardware-in-the-loop (HIL) simulator, where thermal power generation and pumped-storage hydropower generation were simulated. This ensured that the variables were highly coupled and correlated for a richer dataset. In addition, an open platform communications united architecture (OPC-UA) gateway was installed to facilitate data collection from heterogeneous devices.

  • The first version of HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from both normal and anomalous situations for 38 attacks. Subsequently, a debugged version of HAI 1.0, namely HAI 20.07, was released for the HAICon 2020 competition in August 2020.

  • HAI 21.03 was released in 2021, and is based on a more tightly coupled HIL simulator to produce clearer attack effects with additional attacks. This provided more quantitative information and covers a variety of operational situations and better insights into the dynamic changes of the physical system.

HAI Testbed

The testbed consists of four different processes: boiler, turbine, water-treatement and HIL simulation:

  • Boiler Process (P1): A water-to-water heat-trasfer process with low pressure and moderate temperature. It is controlled by Emerson's Ovation DCS.

  • Turbine Process (P2): A rotor kit process that closely simulates the behavior of an actual rotating machine. It is controlled by GE's Mark VIe DCS.

  • Water-treatment Process (P3): A water-treatment process that includes the pumping of water to the upper reservoir and releasing it back into the lower reservoir. It is controlled by Siemens's S7-300 PLC.

  • HIL Simulation(P4): Both of the boiler and turbine processes are interconnected to reamin sychronous with the rotating speed of the virtual steam-trubine power generation model. The pump and value in the water-treatment process are controlled by the pumped-storage hydropower generation model. The dSPACE's SCALEXIO system is used for HIL simulations and is interconnected with the real-world processes through a Siemens S7-1500 PLC and ET200 remote IO devices for data-acquisition system based on OPC gateway.

HAI Dataset

Two major versions of HAI datasets have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:

Note: The version numbering follows a date-based scheme, where the version number indicates the released date of HAI dataset. HAI 20.07 is the bug-fixed one of the first version HAI v1.0 released in February 2020.

Version Data Points Normal Dataset Attack Dataset
Files Interval Size Files Attack Count Interval size
HAI 21.03 78 points/sec train1.csv 60 hours 100 MB test1.csv 5 attacks 12 hours 22 MB
train2.csv 63 hours 116 MB test2.csv 20 attacks 33 hours 62 MB
train3.csv 229 hours 246 MB test3.csv 8 attacks 30 hours 56 MB
test4.csv 5 attacks 11 hours 20 MB
test5.csv 12 attacks 26 hours 48 MB
HAI 20.07
(HAI1.0)
59 points/sec train1.csv 86 hours 127 MB test1.csv 28 attacks 81 hours 119 MB
train1.csv 91 hours 98 MB test1.csv 10 attacks 42 hours 62 MB

Data fields

The time-series data in each CSV file satisfies time continuity. The first column represents the observed time as “yyyy-MM-dd hh:mm:ss,” while the rest columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred or not, where the attack column was applicable to all process and the other three columns were for the corresponding control processes.

Refer to the latest technical manual for the details for each column.

time P1_B2004 P2_B2016 ... P4_HT_LD attack attack_P1 ... attack_P3
20190926 13:00:00 0.09830 1.07370 ... 0 0 0 ... 0
20190926 13:00:01 0.09830 1.07410 ... 0 1 0 ... 1
20190926 13:00:02 0.09830 1.07380 ... 0 1 0 ... 1
20190926 13:00:03 0.09830 1.07360 ... 0 1 1 ... 1
20190926 13:00:04 0.09830 1.07430 ... 0 1 1 ... 1

Getting the dataset

NOTICE: All data files are compressed by the standard GNU zip (gzip) due to a strict maximum size limit of 100 MB for individual files in a repository.

Type git clone, and the paste the below URL.

$ git clone https://github.com/icsdataset/hai

To unzip multiple gzip files, you can use:

$ gunzip *.gz

Performance Evaluation

It is strongly recommended to use the TaPR (Time-series Aware Precision and Recall) method for evaluating your anomaly detection algorithm, which gives fairness to performance comparisons with other sutides. Got something to suggest? Let us know!

Projects using the dataset

Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!

Change Log

Please refer to the technical manual for the detailed changes

  • HAI 21.03 release (2021-03-25)
  • HAI 20.07 release (2020-07-22)
  • Initial release (2020-02-07)

Authors

Created by Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun and HyoungChun Kim in the Affiliated Institute of ETRI, Daejeon, South Korea.

License

This work is licensed under a Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).

References

  1. Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun, and HyoungChun Kim, "HAI 1.0: HIL-based Augmented ICS Security Dataset", 13th USENIX Workshop on Cyber Security Experimentation and Test (CSET 20), Santa Clara, CA, 2020.
  2. Hwang, Won-Seok and Yun, Jeong-Han and Kim, Jonguk and Kim, HyoungChun Kim, "Time-Series Aware Precision and Recall for Anomaly Detection: Considering Variety of Detection Result and Addressing Ambiguous Labeling", CIKM '19:Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp.2241-2244, 2019.
  3. Seungoh Choi, Jeong-Han Yun, Sin-Kyu Kim, "A Comparison of ICS Datasets for Security Research Based on Attack Paths", In: Luiijf E., Žutautaitė I., Hämmerli B. (eds) Critical Information Infrastructures Security. CRITIS 2018. Lecture Notes in Computer Science, vol 11260. Springer, Cham.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name HIL-based Augmented ICS Security Dataset
alternateName HAI Security Dataset
alternateName hai seucrity dataset
url
sameAs https://github.com/icsdataset/hai
description The HAI security dataset was collected from a realistic Industiral Control System (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.
provider
property value
name The Affiliated Institute of ETRI, South Korea
sameAs https://github.com/icsdataset
license
property value
name CC BY 4.0
url
citation https://www.usenix.org/conference/cset19/presentation/shin https://dl.acm.org/doi/10.1145/3357384.3358118

About

HIL-based Augmented ICS (HAI) Security Dataset