anshu0612 / api_malware_classification

Designed ensembled Seq2Seq models using Keras to detect malware in a sequence of API calls, and achieved a top position on Kaggle

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CS5242 Final Project on Dynamic Malware Analysis

Kaggle Competition

The task of this project is to detect the malware based on features extracted from the API calls.

The solution achieved an AUC score of 99.18% on Kaggle's private leadership board

How to train the model and get the test predictions

  1. Download the dataset from Kaggle and keep the extracted data in the project root directory
  2. Do pip install
  3. Run the file kfold_ensemble.py by using the command python kfold_ensemble.py
  4. After training and prediction, output is generated in the file output.csv

Downloading dataset from Kaggle

The easiest way to interact with Kaggle’s dataset is via Kaggle Command-line tool (CLI). Below are the steps to setup Kaggle CLI and use it to download the dataset

The Setup

  1. Install the Kaggle CLI To get started to Kaggle CLI we will need Python, open terminal and type command pip install kaggle
  2. API Credentials Once we have Kaggle installed, type kaggle to check it is installed and we will get an output similar to this

IMAGE

In the above line, we will see the path (highlighted) of where to put your kaggle.json file. To get kaggle.json file go to: https://www.kaggle.com//account

In the API section, click Create New API Token. And copy it the path mentioned in the terminal output.

IMAGE

Type kaggle once again to check. IMAGE

In some case, even after copying the credentials will not work even though the file is placed in the correct location due incorrect permission. Just type the exact command and it will start working

Downloading Dataset via CLI

We can open kaggle help via kaggle -h For getting info on competitions we can type kaggle competitions download -h whatever the Kaggle CLI command is, add -h to get help.

Download Entire Dataset

To download the dataset, go to Data subtab on the competition page. In API section we will find the exact command that we can copy to the terminal to download the entire dataset.

IMAGE

The syntax is like kaggle competitions download <competition name> One the dataset is downloaded extract the dataset and use it.

About

Designed ensembled Seq2Seq models using Keras to detect malware in a sequence of API calls, and achieved a top position on Kaggle


Languages

Language:Python 100.0%