VDiscover is a tool designed to train a vulnerability detection predictor. Given a vulnerability discovery procedure and a large enough number of training testcases, it extracts lightweight features to predict which testcases are potentially vulnerable. This repository contains an improved version of a proof-of-concept used to show experimental results in our technical report (available here).
VDiscover aims to be used when there is a large amount of testcases to analyze using a costly vulnerability detection procedure. It can be trained to provide a quick prioritization of testcases. The extraction of features to perform a prediction is designed to be scalable. Nevertheless, this implementation is not particularly optimized so it should easy to improve the performance of it.
- Python 2.7
- binutils
- python-ptrace
- scikit-learn for training/testing
- matplotlib for visualization (very experimental, optional)
- keras for convolutional clustering (very experimental, optional)
Before starting, it is recommended to manually install binutils, scikit-learn and setuptools (to perform a local installation). Then we can execute:
git clone https://github.com/CIFASIS/VDiscover.git
cd VDiscover
python setup.py install --user
By default, the local installation of the command line utilities of VDiscover is performed inside ~/.local/bin, so it is recommended to add this directory into the PATH variable. Our tool is composed by two main components:
- fextractor: to extract dynamic and static features from test cases.
- vpredictor: to train a new vulnerability prediction model or predict using a previously trained one. It can be used to cluster and visualize a set of test cases.
Some examples of testcases of very popular programs (grep, gzip, bc, ..) can be found in examples/testcases. For example, to extract raw dynamic features from an execution of bc:
fextractor --dynamic bc
And the resulted extracted features are:
/usr/bin/bc isatty:0=Num32B0 isatty:0=Num32B8 setvbuf:0=Ptr32 setvbuf:1=NPtr32 setvbuf:2=Num32B8 setvbuf:3=Num32B0 ...
This raw data can be used to train a new vulnerability prediction model or predict using a previously trained one. Additionally, more detailed (but outdated) documentation is available here.