High-resolution Image-based Malware Classification using Multiple Instance Learning

by Tim Peters, supervised by Hikmat Farhat

Overview

PyTorch Implementation of our paper "High-resolution Image-based Malware Classification using Multiple Instance Learning":

Peters, T. & Farhat, H. (2023). High-resolution Image-based Malware Classification using Multiple Instance Learning. arXiv preprint arXiv:2311.12760. link.

Usage

/attention: Code for the attention-based MIL model

/baseline: Code for the baseline CNN model (non-MIL)

In each:

main.py: Trains the model with the Adam optimizer for 20 epochs and evaluates it on the test set. Also sets up Comet ML logging for metrics.

dataloader.py: Loads the malware samples as images and generates the bags. Parameters: lazy - control pre-loading of samples into memory, test - indicate if loading a test dataset to make small images to log to Comet ML platform, adversarial & adversarial_type - control adversarial enlargement.

model.py: The model implementation.

inference.py: Similar to main.py but only for measuring inference speed. Includes GPU warm-up & GPU sync.

inference_dataloader.py: Similar to dataloader.py but only for inference. No pre-loading into memory.

process_BIG2015_dataset.py: Processes .bytes (hex) files from the Microsoft Malware Classification Challenge (BIG 2015) into .bin (binary) files, removing question marks.

About

PyTorch implementation of my Master's thesis - "High-resolution Image-based Malware Classification using Multiple Instance Learning"

Languages

Language:Python 100.0%