HAI (HIL-based Augmented ICS) Security Dataset

The HAI dataset was collected from a realistic industiral control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

Click here to find out more about the HAI dataset.

Please e-mail us here if you have any questions about the dataset.

Background
HAI Testbed
HAI Dataset
Getting the Dataset
Performance Metric
Projects using the Dataset
Competitions
Contributors
Citation

Background

In 2017, three laboratory-scale CPS testbeds were initially launched, namely GE’s turbine testbed, Emerson’s boiler testbed, and FESTO’s modular production system (MPS) water-treatment testbed. These testbeds are related to relatively simple processes, and were operated independently of each other.
In 2018, a complex process system was built to combine the three systems using a HIL simulator, where generation of thermal power and pumped-storage hydropower was simulated. This ensured that the variables were highly coupled and correlated for a richer dataset. In addition, an open platform communications united architecture (OPC-UA) gateway was installed to facilitate data collection from heterogeneous devices.
The first version of HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from normal and anomalous situations for 38 attacks. Subsequently, a debugged version of HAI 1.0, namely HAI 20.07, was released for the HAICon 2020 competition in August 2020.
HAI 21.03 was released in 2021, and was based on a more tightly coupled HIL simulator to produce clearer attack effects with additional attacks. This version provides more quantitative information and covers a variety of operational situations, and provides better insights into the dynamic changes of the physical system.
HAI 22.04 contained more sophisticated attacks that are significantly more difficult to detect than those in the previous versions. Comparing only the baseline TaPRs of HAICon 2020 and HAICon 2021, detection difficulty in HAI 22.04 is approximately four times higher than HAI 21.03.

HAI Testbed

The testbed consists of four different processes: boiler process, turbine process, water treatement process and HIL simulation:

Boiler Process (P1): This includes water-to-water heat trasfer at a low pressure and a moderate temperature. This process is controlled using Emerson Ovation DCS.
Turbine Process (P2): A rotor kit process that closely simulates the behavior of an actual rotating machine. It is controlled by GE's Mark VIe DCS.
Water treatment Process (P3): This process includes pumping water to the upper reservoir and releasing it back into the lower reservoir. It is controlled by Siemens's S7-300 PLC.
HIL Simulation(P4): Both the boiler and turbine processes are interconnected to synchronize with the rotating speed of the virtual steam-turbine power generation model. The pump and value in the water-treatment process are controlled by the pumped-storage hydropower generation model. The dSPACE's SCALEXIO system is used for the HIL simulations and is interconnected with the real-world processes through a Siemens S7-1500 PLC and ET200 remote IO devices for data-acquisition system based on the OPC gateway.

HAI Dataset

Two major versions of HAI datasets have been released thus far. Each dataset consists of several CSV files, and each file satisfies time continuity. The quantitative summary of each version are as follows:

Note: The version numbering follows a date-based scheme, where the version number indicates the released date of the HAI dataset. HAI 20.07 is the bug-fixed version of HAI v1.0 released in February 2020.

Version	Data Points	Normal Dataset			Attack Dataset
Version	Data Points	File Name	Interval	Size	Files	Attack Count	Interval	size
HAI 22.04	86 points/sec	train1.csv	26 hours	51 MB	test1.csv	7 attacks	24 hours	48 MB
		train2.csv	56 hours	109 MB	test2.csv	17 attacks	23 hours	45 MB
		train3.csv	35 hours	67 MB	test3.csv	10 attacks	17 hours	33 MB
		train4.csv	24 hours	46 MB	test4.csv	24 attacks	36 hours	70 MB
		train5.csv	66 hours	125 MB
		train6.csv	72 hours	137 MB
		Total	279 hours	534 MB	Total	58 attacks	100 hours	196 MB
HAI 21.03	78 points/sec	train1.csv	60 hours	100 MB	test1.csv	5 attacks	12 hours	22 MB
		train2.csv	63 hours	116 MB	test2.csv	20 attacks	33 hours	62 MB
		train3.csv	229 hours	246 MB	test3.csv	8 attacks	30 hours	56 MB
					test4.csv	5 attacks	11 hours	20 MB
					test5.csv	12 attacks	26 hours	48 MB
		Total	352 hours	471 MB	Total	50 attacks	112 hours	205 MB
HAI 20.07 (HAI1.0)	59 points/sec	train1.csv	86 hours	127 MB	test1.csv	28 attacks	81 hours	119 MB
		train1.csv	91 hours	98 MB	test1.csv	10 attacks	42 hours	62 MB
		Total	177 hours	225 MB	Total	38 attacks	123 hours	181 MB

Data fields

The time-series data in each CSV file satisfies time continuity. The first column represents the observed time in the “yyyy-MM-dd hh:mm:ss” format, while the rest of the columns provide the recorded SCADA data points. The last four columns provide data labels for whether an attack occurred or not. Out of these four columns, is applicable to all the process and the other three columns are applicable to the corresponding control processes.

Refer to the latest technical manual for the details for each column.

From the HAI 22.04 version, attack labels for each process (attack_p1, attack_p2, attack p3) have been excluded. This is because they can be replaced by the attack targets (controllers and points) provided for each dataset version.

time	P1_B2004	P2_B2016	...	attack	attack_P1	...	attack_P3
20190926 13:00:00	0.09830	1.07370	...	0	0	...	0
20190926 13:00:01	0.09830	1.07410	...	1	0	...	1
20190926 13:00:02	0.09830	1.07380	...	1	0	...	1
20190926 13:00:03	0.09830	1.07360	...	1	1	...	1
20190926 13:00:04	0.09830	1.07430	...	1	1	...	1

Getting the dataset

Type git clone, and the paste the below URL.

$ git clone https://github.com/icsdataset/hai

To unzip multiple gzip files, you can use:

$ gunzip *.gz

Performance Metric

Use of eTaPR (Enhanced Time-series Aware Precision and Recall) metric is strongly recommended to evaluate your anomaly detection model, which provides fairness to performance comparisons with other studies. Got something to suggest? Let us know!

Projects using the dataset

Here are some projects and experiments that are using or featuring the dataset in interesting ways. Got something to add? Let us know!

The related projects so far are as follows.

Anomaly Detection

Year 2022

Year 2021

Year 2020

Testbed/Dataset

Year 2021

Probabilistic attack sequence generation and execution based on mitre att&ck for ics datasets

Year 2020

Competitions

Since 2020, we have held two AI competitions using the HAI dataset. The competition website shares the competition baseline codes and the winner's codes.

HAICon 2020 (HAI 21.03): https://dacon.io/competitions/official/235624/overview/description
HAICon 2021 (HAI 22.04): https://dacon.io/en/competitions/official/235757/overview/description

Contributors

Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun, and Byung-Gil Min from the Affiliated Institute of ETRI, Daejeon, South Korea.

License

This work is licensed under a Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0).

Citation

If you publish your works that use HAI data sets, HAICon competitions, and eTaPR, please cite the sources below:

HAI 22.04

  @misc{github,
    author={Hyeok-Ki Shin, Woomyo Lee, Jeong-Han Yun and Byung-Gi Min},
    title={HAI 22.04},
    year={2022},
    url={https://github.com/icsdataset/hai},
 }

HAI 21.03, HAICon 2020

@inproceedings{10.1145/3474718.3474719,
    author = {Shin, Hyeok-Ki and Lee, Woomyo and Yun, Jeong-Han and Min, Byung-Gil},
    title = {Two ICS Security Datasets and Anomaly Detection Contest on the HIL-Based Augmented ICS Testbed},
    year = {2021},
    isbn = {9781450390651},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3474718.3474719},
    doi = {10.1145/3474718.3474719},
    abstract = {Security datasets with various operating characteristics and abnormal situations of industrial control system (ICS) are essential to develop artificial intelligence (AI)-based control system security technology. In this study, we built a hardware-in-the-loop (HIL)-based augmented ICS (HAI) testbed and developed ICS security datasets. Here, we introduce the second dataset (HAI 21.03), which was developed with the user feedback of the first released version (HAI 20.07). All HAI datasets are publicly available at https://github.com/icsdataset/hai. HAI 21.03 was expanded by adding data points and normal/attack scenarios to HAI 20.07. We also held an AI-based anomaly detection contest (HAICon 2020) utilizing the HAI datasets developed so far, giving many AI researchers an opportunity to discuss and share ideas for ICS anomaly detection research. This paper presents the results of the HAICon 2020. The results of the top teams in the competition can be used as a performance comparison criterion when using HAI 21.03. },
    booktitle = {Cyber Security Experimentation and Test Workshop},
    pages = {36–40},
    numpages = {5},
    keywords = {security dataset, testbed, artificial intelligence, hardware-in-the-loop, industrial control system, anomaly detection},
    location = {Virtual, CA, USA},
    series = {CSET '21}
}

HAI 20.07

@inbook{10.5555/3485754.3485755,
    author = {Shin, Hyeok-Ki and Lee, Woomyo and Yun, Jeong-Han and Kim, HyoungChun},
    title = {HAI 1.0: HIL-Based Augmented ICS Security Dataset},
    year = {2020},
    publisher = {USENIX Association},
    address = {USA},
    abstract = {Datasets are paramount to the development of AI-based technologies. However, the available cyber-physical system (CPS) datasets are insufficient. In this paper, we introduce the HIL-based augmented ICS security (HAI) dataset 1.0 (https://github.com/icsdataset/hai), the first CPS dataset collected using the HAI testbed. The HAI testbed comprises three physical control systems, namely GE turbine, Emerson boiler, and FESTO water treatment systems, combined through a dSPACE hardware-in-the-loop (HIL) simulator. We built an environment to remotely and automatically manipulate all components of a feedback control loop. Using this environment, we collected the HAI dataset 1.0 while repeatedly running a large number of benign and malicious scenarios for a long period with minimal human effort. We will continue to improve the HAI testbed and release new versions of the HAI dataset.},
    booktitle = {Proceedings of the 13th USENIX Conference on Cyber Security Experimentation and Test},
    articleno = {1},
    numpages = {1}
}

eTaPR

@inproceedings{ 
    10.1145/3477314.3507024,
    author = {Hwang, Won-Seok and Yun, Jeong-Han and Kim, Jonguk and Min, Byung Gil},
    title = {"Do You Know Existing Accuracy Metrics Overrate Time-Series Anomaly Detections?"},
    year = {2022},
    isbn = {9781450387132},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3477314.3507024},
    doi = {10.1145/3477314.3507024},
    booktitle = {Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing},
    pages = {403–412},
    numpages = {10},
    location = {Virtual Event},
    series = {SAC '22}
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name HIL-based Augmented ICS Security Dataset

alternateName HAI Security Dataset

alternateName hai seucrity dataset

url https://github.com/icsdataset/hai

sameAs https://github.com/icsdataset/hai

description The HAI security dataset was collected from a realistic Industiral Control System (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

provider

property	value
name	`The Affiliated Institute of ETRI, South Korea`
sameAs	`https://github.com/icsdataset`

license

property	value
name	`CC BY 4.0`
url	`https://creativecommons.org/licenses/by/4.0/`

citation https://dl.acm.org/doi/abs/10.1145/3474718.3474719 https://dl.acm.org/doi/abs/10.5555/3485754.3485755 https://dl.acm.org/doi/10.1145/3357384.3358118

xuhongzuo / HAI-dataset

HAI (HIL-based Augmented ICS) Security Dataset

Contents

Background

HAI Testbed

HAI Dataset

Data fields

Getting the dataset

Performance Metric

Projects using the dataset

Anomaly Detection

Year 2022

Year 2021

Year 2020

Testbed/Dataset

Year 2021

Year 2020

Competitions

Contributors

License

Citation

HAI 22.04

HAI 21.03, HAICon 2020

HAI 20.07

eTaPR

Dataset Metadata

About