abhishek9sharma / security-patches-dataset

A ground-truth dataset of security patches

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Building a ground truth dataset of real security patches for machine learning and testing activities.

Release 1: Code here; Paper here

Datasets:

Installation

Requirements installation:

virtualenv --python=python3.8 venv
source venv/bin/activate
pip install -r requirements.txt

Data Analysis

Jupyter notebooks are available. Charts and wordclouds are saved at notebook/charts.

source venv/bin/activate
cd notebooks
jupyter notebook

CVE Details Scraper

Scrape https://www.cvedetails.com/ for CVEs from 1999:

source venv/bin/activate
python3 cve-details/scraper.py --mode year -folder data/cve-details/year/ -year 1999

Generates dataset of patches available on GitHub:

source venv/bin/activate
source github_dataset_generator.sh

Dataset

Merge all datasets:

source venv/bin/activate
python3 scripts/merge_datasets.py --mode merge -file positive.csv

Get CVE Details to complete the data from other datasets:

source venv/bin/activate
python3 scripts/get_metadata.py --source cvedetails -file positive.csv

Configure the github API at scripts/config/.

source venv/bin/activate
cd scripts/config/
cp github_template.json github.json

Add a token and username to the file with permissions for repositories and users information.

Get GitHub data:

source venv/bin/activate
python3 scripts/get_metadata.py --source github -file positive.csv

Add features

Adding the extension of files involved in changes (ext_files):

source venv/bin/activate
python scripts/add_features.py --feature ext_files -file positive.csv

Adding programming languages (lang):

source venv/bin/activate
python scripts/add_features.py --feature lang -file positive.csv

Add Code Changes

Adding code changes/diff information to the dataset:

source venv/bin/activate
python3 scripts/get_code_changes.py -fin dataset/positive.csv -fout dataset/data.csv

Download codebases

To download Scala samples:

source venv/bin/activate
python3 scripts/download.py -file dataset/positive.csv -folder code_samples -language Scala

About

A ground-truth dataset of security patches

License:MIT License


Languages

Language:Jupyter Notebook 99.4%Language:Python 0.6%Language:Shell 0.0%