biagiom/acsac22_spacephish

This document describes the Artifact of the paper “SpacePhish: The Evasion-space of Adversarial Attacks against Phishing Website Detectors using Machine Learning”. We also created a website with additional information: SpacePhish website

If you use any of our resources, we kindly ask you to cite our paper with the following BibTeX entry:

@inproceedings{apruzzese2022spacephish,
  title={SpacePhish: The Evasion-space of Adversarial Attacks against Phishing Website Detectors using Machine Learning},
  author={Apruzzese, Giovanni and Conti, Mauro and Yuan, Ying},
  booktitle={Proceedings of the Annual Computer Security Applications Conference (ACSAC)},
  year={2022},
  publisher={ACM, New York, USA},
  doi={10.1145/3564625.3567980}
}

Organization

This repository includes four main folders:

documents_folder: containing the main paper, and other supplementary documents;
ml_folder: containing the source-code of our main experiments;
preprocessing_folder: containing the code of our feature extractor and some attacks
mlsec_folder: containing the code of our attacks against the detectors of MLSEC;

In the root folder of this repository, we have also provided a “requirements.txt” file, specifying which Python libraries were used to carry out all our experiments. Moreover, we also provided a document ("get_data.md") explaining how to retrieve the data for our experiments. This artifact entirely runs on CPU.

In what follows, we will first provide a high-level overview of the documents and data folders, and then explain how to use the corresponding code for a practical evaluation.

Disclaimer

Our paper tackles the problem of phishing website detection via machine learning (ML). As such, performing our experiments “today” and “from scratch” is likely to yield different results than those shown in the paper. This is due to two main reasons:

The “preprocessing” phase of each sample (i.e., a webpage) requires to make some queries to DNS servers. Such servers may give a different response today than the one we received when we performed our experiments.
The “machine learning” phase of our experiments entails the development of 900 ML models (by randomly drawing samples belonging to different datasets and using diverse ML algorithms analyzing feature sets). The results reported in the paper are the average of all such evaluations. Hence, it is likely that a “novel” experiment may lead to a different outcome (due to the high role played by randomness in the general context of ML)

To account for the above, and to facilitate the reviewing process, we:

[data] report the preprocessed version of each sample (for both its original and adversarial variant) which we used for our ML experiments.
[code] provide three jupyter-notebooks describing a single “run” of our ML experiments (on a single dataset) having a “random seed” whose result match those in our paper.

We first explain the documents folder, then the data folder (which must be downloaded separately), and finally the folders containing the source code.

documents_folder

This folder contains 6 files:

ACSAC_SpacePhish-paper.pdf, which is the main paper.
ACSAC_SpacePhish-supp.pdf, which is a document explaining (at a high level) some implementation details (this document was provided to the reviewers during the submission).
reference_tables.png, which is are three images showing the “range” of the results we obtained during our experiments (one for each algorithm). We expect that any future experiment will achieve results within (or close to) such range (we repeated our experiments 50 times).
mlsec_results.xlsx, which is a spreadsheet containing the full results of all our cheap attacks on the detectors of the MLSEC competition.

data_folder

Preliminary Information

Let us provide some essential background information for those who are not experts in the specific problem tackled by our paper.

Sources. Our paper entails experiments carried out on two datasets: DeltaPhish and Zenodo (both of which are publicly available), containing “raw data” of webpages (benign or phishing). For transparency, we include in this repository all such “raw data”, which will be deleted after the review of the artifact (to avoid potential copyright violations). We will, however, maintain the preprocessed version of each webpage.
Attacks. Our paper entails “adversarial attacks against machine learning”, whose basic principle is to (i) take a sample, (ii) manipulate such sample in some way, and (iii) assess whether the “adversarial sample” evades a given ML model or not. Specifically in our case, we consider a total of 12 adversarial attacks, meaning that we artificially create 12 “adversarial variants” of each “original sample” (i.e., a phishing webpage). Some of these variants are created “at runtime”, whereas the others are created “in advance” (we did this by manually manipulating each raw sample).

Structure

The folder is organized depending on the dataset (Zenodo or Deltaphish), the format (raw or preprocessed). Let us explain both of these:

raw: this folder contains the “original” data as well as the adversarial variants of each sample.
- normal. This folder contains information on the “original” webpages. It contains a JSON file with the URLs of each sample; and an “HTML” folder containing the raw HTML of each sample
- wa. This folder refers to the “cheap” attacks considered in our paper. It contains files including the HTML of phishing webpages after applying the “cheap” HTML manipulation.
- wa+. This folder refers to (a subset of) the “advanced” attacks of our paper. It contains files including the HTML of phishing webpages after applying the “advanced” HTML manipulations.
Preprocessed: this folder includes data describing the “preprocessed” format of each sample in the “raw” folder---after the application of our custom-built feature extractor.
- normal. This folder contains a single JSON file describing the feature representation of each “normal” sample (benign and phishing)
- wa and wa+. These folders contain three subfolders ("u, r, c") each referring to a specific variant of our wa/wa+ attacks. Each subfolder has a single JSON file, which contains the feature representation of each (phishing) sample after applying adversarial manipulation.
- phish_sub_test_x_100.pkl. This is a “pickle” file including the 100 samples used as basis for our our adversarial attacks in the preprocessing space.

preprocessing_folder

This folder contains 4 files:

extractor.py: this python script analyzes a sample (a phishing webpage) and extracts its feature representation.
feature_extraction.ipynb: is a (small) notebook showcasing the application of extractor.py on a single sample.
PA_PSP.ipynb: is a notebook that applies the perturbations related to the preprocessing attacks considered in our paper.

ml_folder

This folder contains 4 files, which refer to the experiments performed on the “DeltaPhish” dataset:

ML.py, containing some custom-defined functions for developing our ML models and printing their results
RF/CN/LR_experiments.ipynb, which are notebooks containing the experiments for each of the 3 main ML algorithms considered in our paper (RF=random forest, LR=logistic regression, CN=convolutional neural network).

mlsec_folder

This folder contains the data and code for the attacks against the detectors of MLSEC. It contains one folder and two files:

data, which is a folder containing the “original” webpages provided by MLSEC; as well as a subfolder “wsp” in which the adversarial variants of such originals will be saved (we already included all the variants generated via our WA attacks)
mlsec_artifact_manipulate.ipynb, which is a notebook containing all the simple manipulations described in our supplementary material, as well as the queries to the MLSEC API
mlsec_artifact_checker.ipynb, which is a notebook that provides a "bulk" checking of all the webpages (original and adversarial ones) created via the previous notebook (UPDATE: Unfortunately, the MLSEC API is no longer supported by its developers after December'22, so this notebook will not work properly)

INSTRUCTIONS

Let us explain how to use our artifact.

Get the data, and install requirements. This is self explanatory; we recommend creating an ad-hoc virtual environment for this purpose (PyCharm works very well). Important: the data_folder should be placed in the root directory!
Test the feature extractor. Simply run the preprocessing_folder/feature_extraction.ipynb notebook once. It should prove that the feature extractor “works”.
Create the adversarial samples corresponding to PA. Simply run the preprocessing_folder/PA_PSP.ipynb once.
Test the attacks. Consider any of the three notebooks (e.g., “ml_folder/experiments_RF.ipynb”) and run all of its cells. The LR and RF do not take long to train, whereas the CN can take several minutes (the runtime on our platform is provided in the documentation). Every cell reports the part in the paper in which the corresponding result is “shown”. Due to randomness, the results can differ from those in the paper (which are provided just as the average and std. dev.): please refer to the “reference_tables.png” file to assess the fidelity of a given result.
Check the MLSEC results. Simply run the mlsec_folder/mlsec_artifact_checker.ipynb notebook, which will automatically query the MLSEC API and provides the results described in the supplementary material and reported in the documents_folder/results.xlsx file. The MLSEC API is still active, so these results are 100% reproducible (unless the ML-PWD change at the server side).
Play around with MLSEC notebook. Run the mlsec_folder/mlsec_artifact_manipulate.ipynb notebook and see its effects on a specific webpage. Feel free to “visually” inspect the adversarial variant of any given webpage, as well as change the amount of links added, or the corresponding string. The MLSEC API is still active.

biagiom / acsac22_spacephish