Defeating-Digital-Threats---Microsoft-Dataset-Malware-Classification

A comprehensive case study of malware dataset from Microsoft

Notebook is too long to render (download the zip and run it locally). Here's the learnings from the case study:

Problem Description

Preventing malware attacks to a computer system by identifying whether a given file/software is malware. Identifying the malware files is very crucial for the security of the system.

Data Source: https://www.kaggle.com/c/malware-classification/data

Objectives

Predict the class (from the 9 label classes) of Malware for a given file

Constraints

Minimize multi-class error
Multi-class probability estimates
Fast processing and labelling of malwares (~ in minutes)

Performance metrics

Multi-class Log loss
Confustion matrix

Train Test Split of Data

Random split on the dataset for training, cross validation and testing with 64%, 16%, 20% of data respectively.

EDA

Here's plots and insights from some of the most impressive EDA results.

Here's few observations from the above plots:

-Labels 1, 2, 3 are most recurring labels/classes of malwares -Labels 8,9 are followed -Label 4,5,7 are the least recurring labels, with fewer data points for these labels.

From the above plot, the size of the byte file might be useful in classifying the type of malware.

Using t-SNE for dimensionality reduction, we try to check if we can classify the labels in a 2-D scatter plots.

With a perplexity value of 50 (number of neighborhood relationships preserved), the graph does not clearly provide distinct boundaries between the different labels.

Machine learning models

The performance of different machine learning models is measured with the precision matrix. The matrix results plotted with each trained machine learning model is shown below.

KNN Classification

Trained a KNN classifier, obtained the optimal k as 3 with Calibrated CV. The precision matrix for predictions on test set is shown below:

Log loss for classification on test set: 0.089

Logistic Regression

Trained Logistic regression classifier, with L2 penalty as regularization and sigmoid activation function with Calibrated CV. The precision matrix for predictions on test set is shown below.

Log loss for classification on test set: 0.415

Random Forest Regression

Trained Random Forest model with number of estimators as optimization parameter with Calibrated CV. The best value was found with 1000 estimators. The precision matrix for predictions on test set is shown below.

Log loss for classification on test set: 0.0503

XGB Model

Trained XGB Classifier, and performed RandomizedSearchCV for the parameters learning-rate, estimators, max-depth, col_sampling, subsample etc. The precision matrix for predictions on test set is shown below.

Log loss for classification on test set: 0.0.032

The best performance is observed for XGB model with the following hyper-parameters:

n_estimators = 1000
max_depth = 3
learning_rate = 0.03.

prajwalk7 / Defeating-Digital-Threats---Microsoft-Dataset-Malware-Classification