tormalwarefp

Description: This repository contains code and datasets for the ACM CCS 2022 paper:

Title: Exposing the Rat in the Tunnel: Using Traffic Analysis for Tor-based Malware Detection

Authors: Priyanka Dodia, Mashael AlSabah, Omar Alrawi, Tao Wang

Our proposed solution is a Machine Learning based prototype designed to identify stealthy Tor-based malware C&C connections using traffic analysis on encrypted Tor traffic. The models further infer the type of malware from the Tor traffic by fingerprinting malicious behavior at the connection and host-levels.

Note: Conference presentation slides PDF

Files & Control Flow:

Main file: classify_topk.py

USAGE: python(3.7) classify_topk.py --options [options_file] --topk [topk] --train/zeroday

[options_file]: Options file defining parameter inputs for classification

[topk]: Use k=1 or k=3 for topk most active Tor connections (connections with most activity)

[train]: Set option to train models for binary/multi-label classification

[zeroday]: Set option to test trained models on provided zeroday data

Datasets provided:

train_D5: Data used for training/validation/testing ML models
zerodaytest.zip: Zero day data for testing the trained models on unseen malware Tor traffic

Note: The data consists of cell files representing connections from a PCAP (ie. Tor traffic obtained from malware/benign binary executions in the Falcon Sandbox). Connection-level features use Tor cell direction, time, order information and Host-level features use information from all Tor connections in a PCAP (appended to the end of each cell file).

Option files provided:

options-D5
options-D5_host
options-zeroday_binary
options-zeroday_multilabel

1. Binary Classification: Classify Tor-based malware and benign connections

Scenarios:

Note(!): 'MULTICLASS' option must be set to 0 in options file

Train models with CONNECTION-LEVEL features only [Hayes et al. 2016] derived from top3 highly active Tor connections
```
cmd: python classify_topk.py --options options-D5 --topk 3 --train
```
Train models with CONNECTION+HOST-LEVEL features [Dodia et al. 2022] using top3 highly active Tor connections for connection-level features
```
cmd: python classify_topk.py --options options-D5_host --topk 3 --train
```

2. Multi-label Classification: Infer malware class type

Note(!): 'MULTICLASS' option must be set to 1 in options file

Same commands as used in binary classification.

3. Zeroday Scenario: Test models using traffic from new, unseen binaries (EternalRocks malware)

Identify zeroday malware connections using pre-trained binary classifier model

cmd: python classify_topk.py --options options-zeroday_binary --topk 3 --zeroday

Identify type of malware (class labels) using pre-trained multi label classifier models
```
cmd: python classify_topk.py --options options-zeroday_multilabel --topk 3 --zeroday
```

Note:

All experiments can be run with topk=1 or topk=3 (optimal results achieved when top3 most active Tor connections are used for training & testing).
Host features can be activated/deactivated by setting HOSTFTS to True/False or commenting in/out in the options file.
Models trained with HOSTFTS, must be tested with HOSTFTS option activated in the test (ie. in the zeroday option files).

malfp / tormalwarefp