dsqmoore / Microsoft-Malware-Classification-Challenge

Beating the benchmark for Microsoft Malware Classification Challenge (BIG 2015)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Microsoft-Malware-Classification-Challenge

Beating the benchmark for Microsoft Malware Classification Challenge (BIG 2015)

Hi Kagglers,

Here is my github repository for the solution that has scored 0.1826662 on leader board. Solution is quite simple, tiresome part is data preparation. It used only .byte files to predict category. It calculate frequency of two-byte-codes (00 to FF) along with ?? and use that information for prediction.

Before using these files you have to follow this step:

  1. Extract .byte files from train and test 7z
  2. Gzip .byte files to .byte.gz format and move to train_gz / test_gz file.

I know these two steps will take hell lot of time, for me 6 hours. :)

Once you have 10868 train files and 10873 test files in gz format, run following commands

python data_consolidation.py

python solution.py

Use it, tune it and score as low as you can.

This script should run with Python-2 and Python-3 both. Let me know if you face any problems.

About

Beating the benchmark for Microsoft Malware Classification Challenge (BIG 2015)


Languages

Language:Python 100.0%