Code repository for the paper On Visual Hallmarks of Robustness to Adversarial Malware

A series of related blog posts can be found here.

Installation

if you have conda installed, you can just cd to the main directory and execute the following with osx_environment.yml or linux_environment.yml on OSx or Linux, respectively.

conda install nb_conda
conda config --add channels conda-forge
conda env create --file ymls/(osx|linux)_environment.yml

This will create an environment called nn_mal.

To activate this environment, execute:

source activate nn_mal

PS1: If you're going to use losswise, you may run into an issue of one print line whose argument is not enclosed by brackets, just put the brackets if this error shows up and you're good to go.

PS2: If you’re running the code on Mac OS with Cuda, then according to Pytorch.org “macOS Binaries dont support CUDA, install from source if CUDA is needed”

Jupyter Notebook Code Walkthrough - Synthetic Data

jupyter_tutorial.ipynb provides a walkthrough of the code and each of the figures using a synthetic dataset where malicious vectors have bits set with probability 0.2 and benign vectors have bits set with probabiltiy 0.8.

Make sure your jupyter notebook kernel is set to the nn_mal conda env. In order to have nn_mal show up in the notebook under Kernel->Change kernel, run this command after activating the env:

python -m ipykernel install --user --name nn_mal --display-name "nn_mal"

Full Process Walkthrough - Sample Dataset

1. Assembling Portable Executable (PE) Dataset

The first step is to gather a dataset of benign and malicious PE files. Each sample is then turned into its corresponding feature vector after examining the entire dataset to create a mapping from imported function to index. We do not include the actual samples in this repo but we provide the generated feature vectors in sample_dataset_saved_feature_vectors and describe the process in (2).

2. Generating Feature Vector Files

In order to save time during training, we generate the feature vector for each file and save it as a pickle file rather than recreating the feature vector each time we load a PE file. To do this, we modify the malicious_filepath and benign_filepath parameters in parameters.ini to match the locations of our malicious and benign files respectively. Change the location of the saved vectors by modifying the parameter, saved_vectors_directory. To generate the vectors we run.

python generate_vectors.py

3A. Train Model

NOTE: This is the step to start at when running this code for the first time.

The file framework.py performs the actual model training and parameters.ini provides the specifications. This design pattern is used throughout the various packages. For the sample dataset, we set the use_saved_feature_vectors to True in order to use the generated feature vectors from step 2. To train a model, simply run:

python framework.py parameters.ini

Sections 3B, 3C, and 3D provide an overview of the parameters available.

3B. Parameter.ini Dataset Parameters Explanation

malicious_filepath - directory containing malicious PE files or saved feature vectors
benign_filepath - directory containing benign PE files or saved feature vectors
helper_filepath - directory containing index mappings and file lists
malicious_files_list - a list of malicious files to use, None uses all in the directory
benign_files_list - a list of benign files to use, None uses all in the directory
load_mapping_from_pickle - indicates whether or not to load a precreated function-to-index mapping file
pickle_mapping_file - path to a function-to-index mapping pickle file
generate_feature_vector_files - set to True only when running generate_vectors.py
use_saved_feature_vectors - whether to use saved vectors or regenerate each time a PE is loaded

3c. Parameter.ini General Parameters Explanation

is_synthetic_dataset - Generate feature vectors by randomly setting bits with some probability
is_cuda - True if GPU enabled, False otherwise
use_seed - Whether to seed (for reproducibility)
is_losswise - Losswise integration
losswise_api_key - API key for Losswise integration
training_method - the inner maximizer method used to create examples for training (natural, dfgsm_k, rfgsm_k, bga_k, and bca_k)
evasion_method - the inner maximizer method to use when generating adversarial examples in validation or test phase
experiment_suffix - name of experiment
train_model_from_scratch - if True, training process will take place
load_model_weights - if True, no training, pre-trained model loaded instead
model_weights_path - path to saved PyTorch model
num_workers - number of workers to use for PyTorch Dataloaders
model_output_directory - directory to save models in

3d. Parameter.ini Hyperparam Parameters Explanation

ff_h1, ff_h2, ff_h3 - sizes of the three hidden layers
ff_learning_rate - learning rate
ff_num_epochs - number of epochs to train and test on
evasion_iterations - number of iterations to perform iterative inner maximizer methods

4. Generating All Training Model Combinations

run_experiments.py is a script that runs framework.py with each training and test inner maxmizer combination.

python run_experiments.py

At this point, there should be 5 saved models in trained_models/, each with a different inner maximizer method used for training.

5. Collecting Accuracy and Evasion Results

Run the following script to generate tex files with results in result_files/

python utils/collect_results.py [insert_experiment_name_here]

6. Generating Adversarial Vectors

We can use the naturally trained model in combination with each of our evasion methods to generate a set of adversarial vectors produced by each method. Make sure the "experiment_name" and "saved_model_directory" parameters are set properly in generate_adversarial_parameters.ini as well as "output_directory_for_adv_vecs", the output location for the adversarial vectors. To generate, go to the generate_adversarial/ directory and run:

python generate_adversarial.py

7. Generating Histograms and Loss Progressions (Figures 3 and 4)

To generate loss progressions and histograms, simply run the following in the loss_graphs/ directory, taking care to ensure that experiment_name is set properly in figure_generation_parameters.ini

python run_loss_landscape_experiments.py [insert_experiment_name_here]
python run_histogram_experiments.py [insert_experiment_name_here]

The figures will be output to the directories loss_progressions/ and histograms/.

8. Generating 3D Loss Landscapes (Figure 5A and 5C)

There are two options for generating loss landscapes: calculating the loss using only vectors generated using the same inner maximizer used to train the model (Figure 5 Column A) and calculating the loss using all types of adversarial vectors (Figure 5 Column C). This is controlled by the parameter *use_all_attack_variants" in loss_visual_params.ini. The plot_size and increment parameters in loss_visual_params.ini cause the alpha and beta values for filter-wise normalization to lie in the grid of -plot_size to +plot_size incrementing by increment.

To generate loss landscapes for each model type:

python run_loss_visualization_experiments.py [insert_experiment_name_here]

9. Training Self-Organizing Maps and Plotting Decision Map (Figure 5B and 5D)

There are two steps to generating the decision map plots: training the self-organizing map (SOM) and using it to plot the decision map.

Similar to the loss landscape methods, we can either train a SOM using all the adversarial vectors or a single type of adversarial vector. The latter is used for models trained with the same inner maximizer method. The number of vectors of each type, number of training epochs, and dimensionality of the SOM is set in the [hyperparam] section of som_parameters.ini. To train a SOM after setting parameters:

python train_som.py som_parameters.ini

The SOM is saved as as pickle file in som_pickles/.

To plot a decision map, set the variables som_pickle_dir and som_pickle_file in som_parameters.ini accordingly given previous training. If plot_all_attack_variants, all types of adversarial vectors will be shown on the decision map (Figure 5 Column D). If it is set to false, only one type will be plotted (Figure 5 Column B). In this case, 5 SOM's, each trained with a single type of adversarial vector, must be provided at the place of the TODO in som_filenames. To generate decision maps:

python som_decision_map.py som_parameters.ini

ALFA-group / adv-malware-viz