Building a ground truth dataset of real security patches for machine learning and testing activities.
Release 1: Code here; Paper here
Datasets:
- CVEDetails - Includes data from CVEs from 1999 to 2021 (6486 patches) -- last update: 28-08-2021.
- SecBench - Dataset with 687 patches for different programming languages.
- BigVul
- SAP
- OSV - Project maintained by Google. It integrates vulnerabilities from different ecosystems: Go, Google Rust, PyPI, DWF, OSS-Fuzz.
- CrossVul
- CVEFixes
Requirements installation:
virtualenv --python=python3.8 venv
source venv/bin/activate
pip install -r requirements.txt
Jupyter notebooks are available. Charts and wordclouds are saved at notebook/charts
.
source venv/bin/activate
cd notebooks
jupyter notebook
Scrape https://www.cvedetails.com/
for CVEs from 1999:
source venv/bin/activate
python3 cve-details/scraper.py --mode year -folder data/cve-details/year/ -year 1999
Generates dataset of patches available on GitHub:
source venv/bin/activate
source github_dataset_generator.sh
Merge all datasets:
source venv/bin/activate
python3 scripts/merge_datasets.py --mode merge -file positive.csv
Get CVE Details to complete the data from other datasets:
source venv/bin/activate
python3 scripts/get_metadata.py --source cvedetails -file positive.csv
Configure the github API at scripts/config/
.
source venv/bin/activate
cd scripts/config/
cp github_template.json github.json
Add a token and username to the file with permissions for repositories and users information.
Get GitHub data:
source venv/bin/activate
python3 scripts/get_metadata.py --source github -file positive.csv
Adding the extension of files involved in changes (ext_files):
source venv/bin/activate
python scripts/add_features.py --feature ext_files -file positive.csv
Adding programming languages (lang):
source venv/bin/activate
python scripts/add_features.py --feature lang -file positive.csv
Adding code changes/diff information to the dataset:
source venv/bin/activate
python3 scripts/get_code_changes.py -fin dataset/positive.csv -fout dataset/data.csv
To download Scala samples:
source venv/bin/activate
python3 scripts/download.py -file dataset/positive.csv -folder code_samples -language Scala