Real-CyberSecurity-Datasets

Public datasets to help you tackle various cyber security problems using Machine Learning or other means.

Happy Learning!!!

AB-TRAP Framework for Dataset Generation
HIKARI-2021 Datasets
The ADFA Intrusion Detection Datasets
Botnet and Ransomware Detection Datasets
Malicious URLs Dataset
Cloud Security Datasets
Dynamic Malware Analysis Kernel and User Level Calls
ARCS Data Sets
Stratosphereips Datasets
Windows Malware Dataset with PE API Calls
KAGGLE
Cloudtrail
MAWILab
EMBER
Industrial Control System (ICS) Cyber Attack Datasets
Canadian Institute for Cybersecurity
Publicly available PCAP files
Shadowbrokers EternalBlue/EternalRomance PCAP Dataset
AZSecure Data
Secrepo

↑ AB-TRAP Framework for Dataset Generation

It is a five-step framework consisting of (i) the generation of the attack dataset, (ii) the bonafide dataset, (iii) training of machine learning models, (iv) realization of the models, and (v) the performance evaluation of the realized model after deployment.

This repositories contains the examples for both Local Area Network (LAN), and the Internet environment taking advantage of virtualization (virtual machines and containers) to support the dataset generation.

https://github.com/c2dc/AB-TRAP/

↑ HIKARI-2021 Datasets

HIKARI-2021 datasets contains encrypted synthetic attacks and benign traffic.

https://zenodo.org/record/5199540

↑ The ADFA Intrusion Detection Datasets

ADFA IDS Datasets consist of following individual IDS datasets:

Network and Linux host IDS datasets:ADFA-LD-dataset, netflow-IDS-dataset, and NGIDS-DS IDS Dataset.
Windows based IDS dataset ADFA-WD.

https://ojs.unsw.adfa.edu.au/xfiles/pdf/ADFA-IDS-Database%20License-homepage.pdf

In the above PDF document you will find the two (2) links for downloading the aforementioned datasets (2017).

↑ Botnet and Ransomware Detection Datasets

The ISOT Botnet dataset is the combination of several existing publicly available malicious and non-malicious datasets.

https://www.uvic.ca/engineering/ece/isot/datasets/botnet-ransomware/index.php

↑ Malicious URLs Dataset

The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

http://www.sysnet.ucsd.edu/projects/url/#datasets

↑ Cloud Security Datasets

The ISOT Cloud IDS (ISOT CID) dataset consists of over 8Tb data collected in a real cloud environment and includes network traffic at VM and hypervisor levels, system logs, performance data (e.g. CPU utilization), and system calls.

"The dataset cannot be downloaded directly. Instead you need first to fill an agreement about how the data will be used;"

https://www.uvic.ca/engineering/ece/isot/datasets/cloud-security/index.php

↑ Dynamic Malware Analysis Kernel and User-Level Calls

This dataset contains the data collected from Cuckoo and our own kernel driver after running 1000 malicious and 1000 clean samples.

https://zenodo.org/record/1203289#.YFhIS-axWoh

↑ ARCS Data Sets

Unified Host and Network Data Set: it is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days.
Comprehensive, Multi-Source Cyber-Security Events: this data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network.
User-Computer Authentication Associations in Time: This anonymized data set encompasses 9 continuous months and represents 708,304,516 successful authentication events from users to computers collected from the Los Alamos National Laboratory (LANL) enterprise network.

https://csr.lanl.gov/data/

↑ Stratosphereips Datasets

The Stratosphere IPS feeds itself with models created from real malware traffic captures. By using and studying how malware behaves in reality, we ensure the models we create are accurate and our measurements of performance are real.

https://www.stratosphereips.org/datasets-overview

The CTU-13 Dataset. A Labeled Dataset with Botnet, Normal and Background traffic.
Malware Capture Facility Project.
Malware on IoT Dataset.
Aposemat IoT-23 (A labeled dataset with malicious and benign IoT network traffic).
The Android Mischief Dataset.

↑ Windows Malware Dataset with PE API Calls

Public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cyber security researchers for malware analysis in csv file format for machine learning applications.

https://github.com/ocatak/malware_api_class

↑ KAGGLE

Various datasets provided by Kaggle (Explore, analyze, and share quality data. Learn more about data types, creating, and collaborating).

https://www.kaggle.com/datasets

e.g. https://www.kaggle.com/c/malware-classification/overview (Microsoft Malware Classification Challenge (BIG 2015))

↑ Cloudtrail

Public dataset of Cloudtrail logs from flaws.cloud.

https://summitroute.com/blog/2020/10/09/public_dataset_of_cloudtrail_logs_from_flaws_cloud/

Dataset (logs data): http://summitroute.com/downloads/flaws_cloudtrail_logs.tar

↑ MAWILab

MAWILab is a database that assists researchers to evaluate their traffic anomaly detection methods. It consists of a set of labels locating traffic anomalies in the MAWI archive (samplepoints B and F). The labels are obtained using an advanced graph-based methodology that compares and combines different and independent anomaly detectors. The data set is daily updated to include new traffic from upcoming applications and anomalies.

http://www.fukuda-lab.org/mawilab/index.html

↑ EMBER

The EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. The EMBER2017 dataset contained features from 1.1 million PE files scanned in or before 2017 and the EMBER2018 dataset contains features from 1 million PE files scanned in or before 2018. This repository makes it easy to reproducibly train the benchmark models, extend the provided feature set, or classify new PE files with the benchmark models.

https://github.com/elastic/ember

↑ Industrial Control System (ICS) Cyber Attack Datasets

It consist of the following four (4) datasets:

Dataset 1: Power System Datasets
Dataset 2: Gas Pipeline Datasets
Dataset 3: Gas Pipeline and Water Storage Tank
Dataset 4: New Gas Pipeline

https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets

↑ Canadian Institute for Cybersecurity

Canadian Institute for Cybersecurity datasets are used around the world by universities, private industry, and independent researchers.

https://www.unb.ca/cic/datasets/index.html

↑ Publicly available PCAP files

This is a list of public packet capture repositories, which are freely available on the Internet. Most of the sites listed below share Full Packet Capture (FPC) files, but some do unfortunately only have truncated frames.

Cyber Defence Exercises (CDX)
Malware Traffic
Network Forensics
SCADA/ICS Network Captures
Capture the Flag Competitions (CTF)
Packet Injection Attacks / Man-on-the-Side Attacks
Uncategorized PCAP Repositories
Single PCAP files
Online PCAP Services

https://www.netresec.com/index.ashx?page=PcapFiles

↑ Shadowbrokers EternalBlue EternalRomance PCAP Dataset

Collected by Eric Conrad. This dataset is comprised of PCAP data from the EternalBlue and EternalRomance malware. These PCAPs capture the actual exploits in action, on target systems that had not yet been patched to defeat to the exploits. The EternalBlue PCAP data uses a Windows 7 target machine, whereas the EternalRomance PCAP data uses a Windows 2008r2 target machine. Also included is EternalBlue PCAP data for a patched Windows 7 target machine showing the failed exploit. This data was collected in April 2017.

https://dibbs.ai.arizona.edu/dibbs/shadowbrokers-eternalblue/ShadowbrokersEternalBlue.zip

↑ AZSecure Data

Data Science Testbed for Security Researchers.

This portal is available to the ISI community to support research. This service started by offering browsing access to downloadable forums from the Artificial Intelligence Lab's Dark Web and Geo Web collections, which presently includes nearly 40 million postings. Each forum collection contains millions of postings from hundreds of thousands of authors, and may be in English, Arabic, French, German, Indonesian, Pashto, Russian or Urdu, depending on the forum. The repository also includes a large collection of Internet phishing websites from the University of Virginia, with collections of Escrow, Financial, and Pharmacy sites. Recent additions to the repository include hacker forums in English and Russian, Chinese underground market forums, and chat logs that can be used in the study of underground behavior and how hackers learn from each other, the formation of social networks, relationships with the underground economy, and more. The Patriot, militia, hate and linked websites collection based off the Southern Poverty Law Center’s 2009 list can be used to study rhetoric and communication, group dynamics, extreme social movements, and other topics, in information and the social sciences.

All data sets can be downloaded freely for non-commercial education and research use.

https://www.azsecure-data.org/

↑ Secrepo

Finding samples of various types of Security related can be a giant pain. This is my attempt to keep a somewhat curated list of Security related data I've found, created, or was pointed to. If you perform any kind of analysis with any of this data please let me know and I'd be happy to link it from here or host it here. Hopefully by looking at others research and analysis it will inspire people to add-on, improve, and create new ideas.

http://www.secrepo.com/

gfek / Real-CyberSecurity-Datasets