DeepSE-WF: Unified Security Estimation for Website Fingerprinting Defenses

This is the code for evaluating Website Fingerprinting (WF) defenses, associated with the paper "DeepSE-WF: Unified Security Estimation for Website Fingerprinting Defenses."

Citation

If you use this code or want to build on this research, please cite our paper:

@article{deepse-wf,
  title={DeepSE-WF: Unified Security Estimation for Website Fingerprinting Defenses},
  author={Veicht, Alex and Renggli, Cedric and Barradas, Diogo},
  journal={Proceedings on Privacy Enhancing Technologies},
  year={2023},
  volume={2023},
  number={2},
  address={Lausanne, Switzerland}
}

Background

Background: Website fingerprinting (WF) attacks have been a growing concern in the field of network security. These attacks, carried out by an eavesdropper on a network, can accurately identify the websites visited by a user by analyzing their traffic patterns, even when the user is accessing the internet through encrypted channels such as Tor or VPNs. This makes WF attacks a serious threat to user privacy and anonymity online.

To counter this threat, several defenses have been proposed in recent years, including randomized packet padding, traffic morphing, and obfuscation techniques. However, the effectiveness of these defenses is often difficult to assess, as attackers can adapt their strategies to bypass them.

To evaluate the security of these defenses, previous works have proposed feature-dependent theoretical frameworks that estimate the Bayes error or mutual information leaked by manually-crafted features. However, as WF attacks increasingly rely on deep learning and latent feature spaces, these frameworks can no longer provide accurate security estimations.

To address this issue, this work proposes DeepSE-WF, a novel WF security estimation framework that leverages specialized kNN-based estimators to produce Bayes error and mutual information estimates from learned latent feature spaces. This approach bridges the gap between current WF attacks and security estimation methods and produces tighter security estimates than previous frameworks.

Setup

The code works for Python 3.8.5. All examples assume a unix shell. First install the requirements using pip

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

or using conda

conda env create -f environment.yml
conda activate deepse

If none of the above work, you may use docker with the provided Dockerfile. The following commands will build the image and execute into the container. From there, you can run the commands as described below. Note that the performance of the docker container may be worse than the native installation. All data will be deleted when the container is stopped.

docker build -t deepse .
docker run -it deepse

If you are using docker and want to recreat the plots, run the following command:

docker run -p 8888:8888 -v $(pwd)/plots:/home/jovyan/ jupyter/scipy-notebook

This will start a jupyter notebook server. After the server has started, it will print a link to the console. Open this link in your browser. There you can find the notebook to recreate the plots.

Estimating the Security of Website Fingerprinting Defenses

This section describes how to estimate the security of website fingerprinting (WF) defenses using the DeepSE-WF framework.

Dataset Format

We consider the dataset collected to consist of files where each trace is stored in the form $W-$T, where $W is the website index and $T is the trace index (both starting from 0). For example, "1-3" is the fourth page load of the second website.

Each of these files contains, per row:

t_i<tab>s_i

with t_i and s_i indicating respectively time and size of the i-th packet. The sign of s_i indicates the packet's direction (positive means outgoing). Note: because this dataset represents Tor traffic, where packets' sizes are fixed, s_i will effectively only indicate the direction, taking value in {-1, +1}.

Preparing the Dataset

Either generate your own data with your defense of download the AWF dataset following the instructions in Prepare the AWF Dataset and simulate your defense on the dataset.

Preparing the Data for DeepSE-WF

The create_dataset.py script takes all the trace files, brings them into the correct format and saves them into a numpy matrix. The data is stored in a .npz file which contains the arrays traces and labels. For example:

python preprocessing/create_dataset.py \
  --in_path <path-to-folder-containing-processed-traces> \
  --out_path <output-file> \
  --n_websites <number-of-websites-to-use> \
  --n_traces <number-of-traces-to-use-per-website>

Measuring the Security

In order to estimate the security for a specific defense, simply run main.py. The results will stored in the file specified by --log_file. For example, we can estimate the security for the dataset from above as follows:

python main.py \
--dataset_path <path-to-dataset.npz> \
--n_traces <number-of-traces-per-website> \
--log_file <output-file>

This will run 5-fold cross validation and report the Bayes Error Rate and Mutual Information estimation in the log.txt file. If you have a GPU, you can use it by adding --device cuda and if you have multiple GPU's on your machine, you can select one using --gpu_id id.

Example Usage

In order to reproduce the results, you can either download the preprocessed traces or follow the instructions in Prepare the AWFDataset to generate the dataset yourself.

Downloading Preprocessed Dataset

The preprocessed awf dataset (100 websites and 4500 traces each) is available here. It can also be downloaded using the following commands (This will take about 22 GB of disk space):

mkdir -p data/dataset/awf
wget -O data/dataset/awf/NoDef.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=NoDef.npz
wget -O data/dataset/awf/wtfpad.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=wtfpad.npz
wget -O data/dataset/awf/Front_T1.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=Front_T1.npz
wget -O data/dataset/awf/Front_T2.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=Front_T2.npz
wget -O data/dataset/awf/cs_buflo.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=cs_buflo.npz
wget -O data/dataset/awf/tamaraw.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FAWF-100-4500\&files\=tamaraw.npz

This will download the preprocessed traces in the data/datasets/awf folder.

The DS19 dataset is available at the same link and can be downloaded using the following commands (This will take about 1 GB of disk space):

mkdir -p data/dataset/ds19
wget -O data/dataset/ds19/NoDef.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=NoDef.npz
wget -O data/dataset/ds19/wtfpad.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=wtfpad.npz
wget -O data/dataset/ds19/Front_T1.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=Front_T1.npz
wget -O data/dataset/ds19/Front_T2.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=Front_T2.npz
wget -O data/dataset/ds19/cs_buflo.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=cs_buflo.npz
wget -O data/dataset/ds19/tamaraw.npz https://polybox.ethz.ch/index.php/s/2hyigdcNv33y33z/download\?path\=%2FDS19-100-100\&files\=tamaraw.npz

Once you have downloaded the dataset, you can directly run the experiments as described in Measuring Security.

Prepare the AWF Dataset

The dataset by Rimmer et al. can be downloaded from here or using the following command (This will take about 90 GB of disk space):

wget https://distrinet.cs.kuleuven.be/software/tor-wf-dl/files/closed_world_csvs.tar.gz

Then, extract the collection of .tar.gz files using the following command (This will take another 90 GB of disk space):

tar -xvzf closed_world_csvs.tar.gz

Currently, tor_run_v1_000.tar.gz seems to be corrupted. You can remove it using the following command:

rm closed_world/tor_run_v1_000.tar.gz

In order to free up some space, you can remove the closed_world_csvs.tar.gz file using the following command:

rm closed_world_csvs.tar.gz

The dataset is provided as a collection of .tar.gz files, each containing a set of websites and traces. The dataset can be extraced and cleaned using extract_awf_tar.py:

python preprocessing/extract_awf_tar.py \
  --in_path <path-to-folder-containing-tar-files> \
  --out_path <output-folder>

e.g.

python preprocessing/extract_awf_tar.py \
  --in_path /Downloads/closed_world/ \
  --out_path data/awf

This will read all .tar.gz files in the input directory and extract them to the output directory. The output directory will contain a subdirectory for each website, containing the traces for that website.

Preprocessing the Traces

In order to create a clean dataset, we need to preprocess the traces. This is done using create_nodef.py:

python preprocessing/create_nodef.py \
  --in_path <path-to-folder-containing-extracted-websites> \
  --out_path <output-folder> \
  --n_websites <number-of-websites-to-use> \
  --n_traces <number-of-traces-to-use-per-website>

e.g.

python preprocessing/create_nodef.py \
  --in_path data/awf \
  --out_path data/traces/NoDef_awf \
  --n_websites 100 \
  --n_traces 100

This will first count the available traces and websites, and then create a new dataset with the specified number of websites and traces if enough websites/traces are available. The output directory will contain all traces in the form $W-$T, where $W is the website index and $T is the trace index (both starting from 0).

Prepare the DS19 Dataset

The dataset by Wang et al. can be downloaded from here or using the following command:

wget https://www.cs.sfu.ca/\~taowang/wf/20000.zip

In order to prepare the dataset, unzip the data:

unzip 20000.zip
rm 20000.zip

Then, move the traces to a new folder data/traces/DS19 and remove the old folder:

mkdir -p data/traces/DS19
mv 20000/*-*.cell data/traces/DS19
rm -rf 20000

Finally remove the .cell extension from all files:

for f in data/traces/DS19/*; do mv "$f" "${f%.cell}"; done

Defending the Dataset

In order to defend the dataset, you can run the simulate_defenses.sh script with the path to the undefended traces as input

bash simulate_defenses.sh <path-to-undefended-traces>

e.g.

bash simulate_defenses.sh ../data/traces/NoDef_awf

Make sure that you are in the defense folder (pwd should end with DeepSE-WF/defenses). This will run all defenses and store the results in the data/defended folder.

Preparing the Data for DeepSE-WF

Preparation of the data for DeepSE-WF is done using create_dataset.py:

python preprocessing/create_dataset.py \
  --in_path data/traces/NoDef_awf \
  --out_path data/dataset/awf/NoDef.npz \
  --n_websites 100 \
  --n_traces 100

loads 100 website and 100 traces per website from from data/traces/NoDef and stores them into data/dataset/NoDef.npz.

Measuring Security

In order to estimate the security for a specific defense, simply run main.py:

python main.py --data_path data/dataset/awf/NoDef.npz \
  --n_traces 100 \
  --log_file log.txt

This will run 5-fold cross validation and report the Bayes Error Rate and Mutual Information estimation in the log.txt file.

Creating the Plots

All the plots can be recreated using the plots/create_plots.ipynb notebook. The plots will be stored in the plots/outputs directory. The results from the paper are stored in the plots/values directory which are copied values from the log files.

In order to recreate the plots, you may adapte the main function to store the relevant information to a .csv with the corresponding columns from the tables in plots/values or update the current values manually. Then you can use the notebook to recreate the plots.

Acknowledgements

We use the codebase by Cherubin et al for the wfes estimations as well as the tamaraw and cs-buflo defense implementations. We also use the codebase by Gong et al. for the Front defense implementation as well as the codebase by Rahman et al. for the Mutual information estimation by wefde.

The df implementation is based on the this repository, the awf implementation is based on the this repository and the tf implementation is based on the this repository. Finally, the var_cnn implementation is based on the this repository.

Commit Hash

The commit hash of the code used for the paper is c2a915f4117531ec7ae80092b4ccbefa51591479.

veichta / DeepSE-WF

DeepSE-WF: Unified Security Estimation for Website Fingerprinting Defenses

Citation

Background

Setup

Estimating the Security of Website Fingerprinting Defenses

Dataset Format

Preparing the Dataset

Preparing the Data for DeepSE-WF

Measuring the Security

Example Usage

Downloading Preprocessed Dataset

Prepare the AWF Dataset

Preprocessing the Traces

Prepare the DS19 Dataset

Defending the Dataset

Preparing the Data for DeepSE-WF

Measuring Security

Creating the Plots

Acknowledgements

Commit Hash

About

Languages