msrocean / continual-learning-malware

This repository contains code and data of the paper **On the Limitations of Continual Learning for Malware Classification**, accepted to be published at the First Conference on Lifelong Learning Agents (CoLLAs).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On the Limitations of Continual Learning for Malware Classification

This repository contains code and data of the paper On the Limitations of Continual Learning for Malware Classification, accepted at the First Conference on Lifelong Learning Agents (CoLLA). The code and dataset can be used to reproduce the results presented in the paper.

[Paper] [Video]

continual-learning-malware

Code Credit

We would like convey our special thanks to Gido Van De Ven. We reused and adapted his implementations of different continual learning techniques from continual-learning and brain-inspired-replay repositories.

Dependencies & Required Packages

Please make sure you have all the dependencies available and installed.

  • NVIDIA GPU
  • CentOS Linux 7 (Ubuntu should also work)
  • Python3-venv/ conda
  • CUDA Version: 10.2
  • Python Version: 3.7.X

-- Please install the required packages using the following command: pip install -r requirements.txt

We suggest the users to create a python virtual environment or conda environment and install the required packages.

conda create -n cl_malware python=3.7
conda activate cl_malware
conda install numpy=1.20.2
conda install pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.2 -c pytorch

To use the given conda environment

Using Anaconda 3 on Linux, run the following:

conda env create -f cl_malware.yml
conda activate cl_malware

Dataset

Drebin dataset:

We cannot share the data to honor the requirement of the data publisher. However, anyone can request the data from the Drebin data publisher to access the data. The dataset is available at [Drebin Dataset]. Please put the data into the drebin_data forlder after downloading. Afterwards, drebin_data/Drebin_Data_Process-CoLLAs-2022.py will generate the training and testing data for the top 18 families that can be used for the experiments of Class-IL and Task-IL.

EMBER 2018 dataset:

EMBER dataset is public and ready to download from this [EMBER repository]. Please note that we use 2018 version of the data with feature version 2. For Domain-IL experiments, we process the data to have month-by-month train-test split from January 2018 to December 2018. For Task-IL and Class-IL, we process the data by malware families. EMBER dataset has 2900 malware families. We select top 100 families each with at least 400 samples.

  • EMBER data needs two separate processing -- i) one for Task-IL and Class-IL experiments, and ii) another for Domain-IL experiments.

  • Task-IL and Class-IL Data: - run ember_data/EMBER_2018_TASK_CLASS_IL_FAMILY-CoLLAs-2022.py

  • Domain-IL Data: - run ember_data/EMBER_2018_DOMAIN_IL_data_process_with_family_labels-CoLLAs-2022.py

Experiments

Drebin Experiments:

  • Task-IL: run drebin_exps/Drebin_Task_IL.sh

  • Class-IL: run drebin_exps/Drebin_Class_IL.sh

EMBER Experiments:

  • Domain-IL: run ember_domain_exps/EMBER_Domain_IL.sh

  • Class-IL: run ember_class_task_exps/EMBER_Class_IL.sh

  • Task-IL: run ember_class_task_exps/EMBER_Task_IL.sh

Reference Format

@inproceedings{continual-learning-malware,
  title = {{On the Limitations of Continual Learning for Malware Classification}},
  author = {Rahman, Mohammad Saidur and Coull, Scott E. and Wright, Matthew},
  booktitle = {First Conference on Lifelong Learning Agents (CoLLAs)},
  year = {2022},
  publisher = Proceedings of Machine Learning Research (PMLR)
}

Acknowledgements

We thank our anonymous reviewers for helpful suggestions and comments. This work was funded by the National Science Foundation under Grant Number 1816851.

About

This repository contains code and data of the paper **On the Limitations of Continual Learning for Malware Classification**, accepted to be published at the First Conference on Lifelong Learning Agents (CoLLAs).

License:MIT License


Languages

Language:Python 92.3%Language:Jupyter Notebook 6.1%Language:Shell 1.6%