Dynamit: Dynamic Vulnerability Detection on Smart Contracts Using Machine Learning

https://dl.acm.org/doi/abs/10.1145/3463274.3463348

Description

Dynamit is a monitoring framework to detect reentrancy vulnerabilities in Ethereum smart contracts. The novelty of our framework is that it relies only on transaction metadata and balance data from the blockchain system; our approach requires no domain knowledge, code instrumentation, or special execution environment. Dynamit extracts features from transaction data and uses a machine learning model to classify transactions as benign or harmful. Therefore, not only can we find the contracts that are vulnerable to reentrancy attacks, but we also get an execution trace that reproduces the attack. Using a random forest classifier, our model achieved more than 90 percent accuracy on 105 transactions, showing the potential of our technique.

Installing and Using the Framework

The system consists of two subsystems:

Monitor subsystem
Detector subsystem

Detector subsystem

As explained in the paper, the detector is based on a machine learning model to detect the harmful transactions. The files related to the detector are the following:

prepare-data2.py in the main directory of the repository
The detector subdirectory

prepare-data2.py is used to pre-process the data of transactions gathered by the monitor section of the framework. The detector subdirectory consists all the pre-processed data (already pre-processed by prepare-data2.py), and ready to be used by our ML models.

The Machine Learning models are generated by using the two dynamit and dynamit-no-callstack jupyter notebook files. There is only one difference between these two files, one of them belongs to the experiment conducted with average-call-stack-depth feature, and the other belongs to the one that build models without this feature. The notebooks are self-explanatory, and contain commets about how they work. The first box of both Jupyter notebooks contains commands to install the same version of Python libraries we used for the experiments. The libraries used are:

matplotlib=3.2.2 (for plotting)
seaborn=0.10.1 (for plotting the heatmap of features)
scikit-learn=0.23.2 (used to create the models)
numpy=1.18.5
pandas=1.2.0

All of the plots (that are used in the paper) generated by Jupyter notebooks are put in detector/figures subdirectory. Besides, the plots related to each experiment have a postfix in their name specifying their respective experiment.

The data of experiments is included in located in detector/data. The main data used for the experiment is detector/data/transactions2.csv. The other two csv files in the same directory belong to other experiments that are not included in the paper.

10-Fold Cross Validation

Since the number of data points is only 105 (gathered from 105 transactions), we employed cross validation to use all of data for training and testing. The same 5-fold cross validated train and test sets were used to build all of the models in all of the experiments. The files related to the indices of these train and tests are detector/data/processing_steps/train_indices_i.csv and detector/data/processing_steps/test_indices_i.csv.

Random Seed of the Algorithms

To have consistent results throught the experiments we used same random seeds for numpy (using np.random.seed function). For instance, to build similar random forests for the experiments we used a list of 100 constants as the seed of numpy randomization to generate each of 100 trees in the forest. This is clearly observable in the Jupyter notebooks.

Data Scaling

We scaled all data in all of our experiments before building the model using Sikitlearn's StandardScaler.

Monitor subsystem

The monitor has the duty of collecting metadata of particular smart contracts that we assign it to. The monitor can be used to connect to any Ethereum network including the mainnet, testnet as well as private Ethereum networks. Monitor needs this connection to probe the blockchain. The target blockchain network has to support websocket provider since web3js library that is used by monitor only can subscribe to events in blockchain through websocket provider.

Running monitor is more complicated than just executing the nodejs application as it needs a blockchain environment with deployed smart contracts to monitor. Here, we assume that you have a Go-Ethereum installation locally to create your own private Ethereum network for the sake of running these tests. To install the required packages for the monitor use the following command while you are in root of repository (recommended version of nodejs is 12.x):

`npm install`

Afterwards, you can start using the nodejs application of the monitor. The workflow of the monitor is:

The contracts should be placed in to the contracts directory under root. These are the contracts that we issue the transactions between them.
The deploy-contract.js file should be used to deploy the contracts to your desired blockchain environment.
Afterwards, based on the information of the deployed contracts you need to update the contents of tx_fuzz.json file.
Before starting the monitor and initiating the transactions, we should collect some information about the contracts right away. This is done by running the following command (it will store the balance of the deployed contracts before running the experiment):

node get-sc-balance.js before
You can start running the monitor by running fuzz_2.json file. This monitor will listen to the specified blockchain network (as you should change the parameters at top of the file) to receive any information regarding the contracts you have given to it through the tx_fuzz.json file. After finishing the experiment, the monitor will automatically save several trace files (e.g. data/trace_i.json) and an out.csv file.
After the experiment finishes, you need to run the following command to probe the balance of the smart contracts after the transactions have been issued:

node get-sc-balance.js after
Now you need to execute get-tx-info.js which probes the blockchain to get more information about the transactions (and it uses the already gathered data/out.csv file). The resulting file will be data/final.csv that will be used by the Detector (more precisely the prepare-data2.py that preprocesses this final.csv file and creates final_prepared.csv file which is then used by our models as transaction.csv)

Citing our work (Dynamit)

Please use the following bibtex entry to cite us:

@inproceedings{eshghieDynamicVulnerabilityDetection2021e,
    series = {{EASE} 2021},
    title = {Dynamic {Vulnerability} {Detection} on {Smart} {Contracts} {Using} {Machine} {Learning}},
    isbn = {978-1-4503-9053-8},
    url = {https://doi.org/10.1145/3463274.3463348},
    doi = {10.1145/3463274.3463348},
    abstract = {In this work we propose Dynamit, a monitoring framework to detect reentrancy vulnerabilities in Ethereum smart contracts. The novelty of our framework is that it relies only on transaction metadata and balance data from the blockchain system; our approach requires no domain knowledge, code instrumentation, or special execution environment. Dynamit extracts features from transaction data and uses a machine learning model to classify transactions as benign or harmful. Therefore, not only can we find the contracts that are vulnerable to reentrancy attacks, but we also get an execution trace that reproduces the attack. Using a random forest classifier, our model achieved more than 90 percent accuracy on 105 transactions, showing the potential of our technique.},
    urldate = {2022-04-22},
    booktitle = {Evaluation and {Assessment} in {Software} {Engineering}},
    publisher = {Association for Computing Machinery},
    author = {Eshghie, Mojtaba and Artho, Cyrille and Gurov, Dilian},
    month = jun,
    year = {2021},
    keywords = {Blockchain, Ethereum, Machine Learning for Dynamic Software Analysis, Smart Contracts, Vulnerability Detection},
    pages = {305--312}
}

mojtaba-eshghie / Dynamit