README

This README would normally document whatever steps are necessary to get your application up and running.

What is this repository for?

Malware detection for Executables
Abstract: Malware is one of the most serious security threats on the Internet today. Unfortunately, the number of new malware samples has explosively increased: anti-malware vendors are now confronted with millions of potential malware samples per year. Consequently, many studies have been reported on using data mining and machine learning techniques to develop intelligent malware detection systems. Lots of works use different feature and different data set to train a classification model. Although they show a high percent of accuracy on their own test data, most of model become rapidly antiquated as malware continues to evolve. When using the obfuscation techniques or polymorphism techniques, they can not work very well. In this work, we propose a effective malware detection approach using data-mining techniques based on opcode, data structure and the imported libraries. We also use different classifiers and conduct some experiments to evaluate our approach. In addition, we provide empirical validation that our method is capable of detecting new unknown malware, also fresh malware collected in 2017. In addition, we use obfuscation on malware to test our model.

How do I get set up?

(Must install) Require python 3. (suggest python 3.5.2)
https://www.python.org/downloads/
(Must install) python package.
http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
Numpy (e.g., numpy-1.13.1+mkl-cp35-cp35m-win_amd64.whl)
scipy (e.g., scipy-0.19.1-cp35-cp35m-win_amd64.whl)
scikit-learn (e.g., scikit_learn‑0.18.2‑cp35‑cp35m‑win_amd64.whl)
Matplotlib (e.g., matplotlib‑1.5.3‑cp35‑cp35m‑win_amd64.whl)
(Unnecessary) IDA Pro
http://pan.baidu.com/s/1bp7rOpp

How do I get dataset?

Microsoft Malware Classification Challenge
https://www.kaggle.com/c/malware-classification
theZoo aka Malware DB
http://ytisf.github.io/theZoo/
DAS MALWERK:
http://dasmalwerk.eu/

How do I run the program?

Put the asm file of malware into "Malware_ML_Set\test", while benign into "Benign_ML_set\test". Our system will automatically labeled it.

If you don't know the file whether it is benign, that is OK. These two file only sign on label for automatically Statistics.

(the asm file must be utf-8 coding)

After you finish above thing:

(Optional) Creating_BM_trainingData.py is use to do the preprocessing for
(Optional) Creating_BM_trainingModel.py can be use to train the model.
Creating_BM_submission.py is for detect the malware
MainFile.py can be use to see the figure.

Intorduction

Malware, or malicious software is a generic term that encompasses viruses, trojans, spywares and other intrusive codes. They are spreading all over the world through the Internet and are increasing day by day, thus becoming a serious threat. According to the recent report from McAfee [1], one of the world's leading independent cybersecurity companies, there are more than 650 million malware samples detected in Q1, 2017, in which more than 30 million ones are new. So the detection of malware is of major concern to both the anti-malware industry and researchers.

To protect legitimate users from these threats, anti-malware software products from different companies provide the major defence against malware, such as Comodo, McAfee, Kaspersky, Kingsoft, and Symantec, wherein the signature based method is employed. However, this method can be easily evaded by malware writers through the evasion techniques such as packing, variable-renaming, and polymorphism [2]. To overcome the limitation of the signature-based method, heuristic-based approaches are proposed, aiming to identify the malicious behaviour patterns, through either static analysis or dynamic analysis. But the increasing number of malware samples makes this method no longer effective. Recently, various machine learning approaches like Support Vector Machine, Decision Tree and Naive Bayes have been proposed for detecting malware [3]. These techniques rely on data sets that include several characteristic features for both malware and benign software to build classication models to detect (unknown) malware. Although these approaches can get a high accuracy (for the stationary data sets), it is still not enough for malware detection. On one hand, most of them focus on the behaviour features such as binary codes [4-6], opcodes [6-8] and API calls [9-11], leaving the data information out of consideration. While a fewl of them do consider the data information, they consider only simple features like strings [12, 13] and le relations [14, 15]. On the other hand, as malware continues to evolve, some new and unseen malware have different behaviours and features. Even the obfuscation techniques can make malware diffcult to detect. Hence, more datasets and experiments are still needed to keep the detection effective.

In this paper, we propose an effective approach to detecting malware based on machine learning. Different from most existing work, we take into account not only the behaviour information but also the data information. Generally, the behaviour information reects what the software intends to behave, while the data information indicates which datas the software intends to perform on or how data are organised. Our approach tries to learn a classier from existing executables with known categories rst, and then uses this classier to detect new, unseen executables. In detail, we take the opcodes, data types and system libraries that are used in executables, which are collected through static analysis, as representative features. As far as we know, our approach is the rst one to consider data types as features for malware detection. Moreover, in our implementation, we employ various machine learning methods, such as K-Nearest Neighbor, Native Bayes, Decision Tree, Random Forest, and Support Vector Machine to train our classier.

Several experiments are conducted to evaluate our approach. Firstly, we con- ducted 10-fold cross validation experiments to see how well various machine learning methods perform. The experimental results show that the classier trained by Random Forest performs best here, with the accuracy 0:9788 and the AUC 0:9959. Secondly, we conducted experiments to illustrate that all the features are effective for malware detection. The result also shows that in some case using type information is better than using the other two. Thirdly, to test our approach's ability to detect genuinely new malware or new malware versions, we ran a time split experiment: we used our classier to detect the malware samples which are newer than the ones in our data set. Our classier can detect 81% of the fresh samples, which indicates that our classier is capable of detecting some fresh malware. The results also suggest that malware classiers should be updated often with new data or new features in order to maintain the classication accuracy. Finally, one reason that makes malware detection diffcult is that malware writers can use obfuscation techniques to evade the detection. So for that, we performed experiments to test our approach's ability to detect new malware samples that are obtained by obfuscating the existing ones through some obfuscation tools. All the obfuscated malware samples can be detected by our classier, demonstrating that our classier has a resistance to some obfuscation techniques.

Who do I talk to?

wcventure@126.com

wcventure / PC-Malware-Sklearner