VeryLazyBoy / Malware-Detection

CS5242 Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Malware-Detection

kaggle Project. The task of this project is to detect the malware based on features extracted from the API calls. More info on Kaggle website.

Requirement

  • Python 3.7
  • Pytorch 10.1
  • requirements.txt
  • GPU with at least 1GB memory avaible (recommended)
  • Downloads train and test data from kaggle

Project Structure

|
|
├───test                       # Downloads from kaggle
|    ├───0.npy
│    ├───...
│    └───6050.npy
├───train                      # Downloads from kaggle
|    ├───0.npy
│    ├───...
│    └───18661.npy
├───train_kaggle.csv           # Downloads from kaggle
|
|
├───train.py                   # Epoch training
├───test.py                    # Generates solution.csv which can be submitted
├───model.py                   # Model
├───run.py                     # Starts training
└───dataset.py                 # Used to provide data in batch

Model

In this project, we are using the same model as described in the paper: Dynamic Malware Analysis with Feature Engineering and Feature Learning. The model structure is shown below:

  • Input: N×C×L tensor, where N is batch size, C is feature size (102) and L is the max sequence length(1000).
    • batchSize: 50
  • Batch Normalization: It speeds up the process of convergence.
  • Gated CNN: It extracts the usable features from the raw input.
    • gated_cnn_outputs: 128
    • gated_cnn_stride1: 1
    • gated_cnn_stride2: 1
    • gated_cnn_kernel1: 2
    • gated_cnn_kernel2: 3
  • BiLSTM: The input features are with sequential patterns and we use bi-directional LSTM to understandboth the past and future context.
    • lstm_layers: 1
    • lstm_neurons: 100
  • MaxPool1D: Extracts the most important features from the hidden states generated by BiLSTM.
  • Dense: Reduces the dimension of feature space.
    • fc_outputs: 64
  • Dropout: Defeats overfitting.
    • dropout: 0.5
  • Sigmoid: Generates probabilities for binary classification.
Exp logs

Exp logs

Exp Description
1573179669 seed:28 90% train, 10% validation, pc
1573200428 seed:29 95% train, 5% validation, pc
1573204629 seed:29 95% train, 5% validation, server
1573983562 pc, batch 50
1574035600 server, batch 25
1574035703 server, batch 100

Train

Python run.py # all the hyperparameters can be set inside run.py

About

CS5242 Project


Languages

Language:Python 83.2%Language:Jupyter Notebook 16.8%