This is the replication package for the submitted paper. It contains the Python code of our simulation, the dataset containing all repositories and the code used for analysis.
This work reuses the classification method of ReDeBug
. We extend their classification method to not only identify missed security patches, but to also identify missed, duplicated and split cases in any type of patch for the programming languages Java, Python, PHP, Perl, C, shellscript and Ruby.
We also look deeper into a patch and classify each file and each hunk in the .diff for that file.
Clone the repository to a location of your choice using git:
git clone https://github.com/patchesandmissedmatches/patchesandmissedmatches.git
Files and folders
Methods
contains helper files used in the main files. There are 9 helper files.- 'patchExtractionFunctions.py' - It has functions for extracting patches from GitHub
analysis.py
- It is used for plotting the classfication results.classifier.py
- It contains helper functions used when classifying.commitLoader.py
- File with helper functions used on/for commit datacommon.py
- This file is re-used from the original implementation or ReDeBug, also used when classification.dataLoader.py
- Helper file to retreive all PR datatotals.py
- Classifies every patch based on the classifications of the files in the patch and calculates totals for each repository pair.patchLoader.py
- This file is re-used from the original implementation or ReDeBug. It is used to load a patch, process it and make it ready for classification.sourceLoader.py
- This file is re-used from the original implementation or ReDeBug. It is used to load a file, process it and make it ready for classification.
Empty_files
contains 7 empty files for each of the 7 programming languages the tool can process. These are used whenever a diff file is needed for an added or removed file in a patch.Repos_prs
will contain the retrieved pull request data.Repos_files
will contain the retrieved files from the pull requests per repository pair.Repos_results
will contain the classification results for the pull requests per repository pair.Repos_totals
will contain the totals data per repository pair.src
contains the files that are at the heart of the tool for classification.tokens.txt
contains the GitHub API tokens seperated by only a comma (,)
Create the Repos_prs, Repos_files, Repos_results and Repos_totals directories before running the tool.
The required libraries are: re, pickle, argparse, requests, json, numpy, matplotlib, bitarray and python-magic
Install the required libraries with pip, e.g.:
pip install re
libmagic
package: apt-get install libmagic-dev
on Ubuntu/Debian, brew install libmagic
on OSX
There are 3 notebooks that are at the heart of this tool: getData.ipynb
, classify.ipynb
and analysis.ipynb
: getData.ipynb extracts the pull request data from the GitHub api
and stores it in Repos_prs after which classify.ipynb extracts the files from each pull request, the diffs
for each modified/added/removed file and classifies the hunk and files. Then analysis.ipynb does the last classification for the patch
and calcualtes the total classification per repository and plots the results. timeLag.ipynb
calcualtes the techinical lag for each patch.
The folder Examples
contains a Jupyter notebook that can be used to quickly run the tool and classify one or more pull requests for two variant repositories. A simple class PaReco
exists that does the classification. Running the notebook for source and target variant mrbrdo/rack-webconsole -> codegram/rack-webconsole and pull request 2 will give an output as:
mrbrdo/rack-webconsole -> codegram/rack-webconsole
Pull request nr ==> 2
File classifications ==>
lib/rack/webconsole/assets.rb
Operation - MODIFIED
Class - ED
Patch classification ==> ED
Additionally the classification distribution will also be plotted.
To use this notebook for classification, you need to have the source and target repository names, list of pull requests and the cut off date.
Create the Repos_files and Repos_results directories before running the examples.
To access private repositories, and to have a higher rate limit, GitHub tokens are needed.
They can be set in the tokens.txt file or by directly inserting it in the token_list
in the notebooks. GitHub tokens are a must to run the code, because of the high number of requests done to the GitHub API.
Every token in the tokens.txt file is seperated by only a comma. The user can add as many tokens as needed. A minimal of 5 tokens can be used to safely execute code and to make sure that the rate limit is not reached for a token.
Using the sorted_all_dataset.pkl
with the notebooks in the src
folder, the results presented in the paper can be reproduced.