Behaviour-based Android Malware Detection

Introduction

We have 100 application samples, half of them are tagged malicious, the other half benign. For each application object we have 1000 data samples, collected at an interval of 1 millisecond for a duration of 1 second. Totally we have 32 features, which are the behaviours of the application, e.g., ram usage, cache usage, etc.

Data Preprocessing

To prepare for algorithm development, we first run the data through some cleanup routines.

Feature Cleanup

Out of 32 features, 11 of them have nearly zero variance[fn:1], i.e., less then \(1e^-6\). These features are removed since they do not provide support for our classification.

Figure fig:tsne_full_feature shows the distribution of all our data samples without feature selection.

Different parameters of t-SNE produce slightly different distribution results. However, of all the results from different parameter combinations, I did not find any clear clusters. Most of the time these two categories are mingled together. Possibile reasons for this are

<lst:p1>Sample size is too limited.
Feature set is too large, relative to the sample size.
<lst:p2>Features do not provide enough support for classification.

Problem lst:p1 and problem lst:p2 are out of my control. we may try reducing the feature set to see if any clear patterns can be spotted.

Data Normalization

Features are all numerical data, in different ranges. So we normalize them into the similar magnitude.

Each data object is measured for 1 second at an interval of 1 millisecond. So each data object actually contains “time series” data. However, quick analysis of the “time series” data shows that all the features have almost zero variance for each data sample. An intuitive explanation for this is that the sample rate is too high. Instead of a “time series” for each data object, we have only one sample per data object. In the following writing, we use data sample to denote the data for each tagged software.

Randomized Test

After the previous two steps, we have the usual classification problem, for which there are totally 100 data samples, half of them are tagged 0 (benign software) and the other half tagged 1 (maliciout software). We use 50% as training set. For each test, we randomly draw 25 samples from each group and test them on our classifiers.

Experiment

We summarize the performance for each classifier.

SVM

Table tab:svm_all_feature_result summarizes the test result for classifier based on Support Vector Machine (SVM). This is the averaged result after 100 test runs.

	Benware	Malware

The interpretation for Table tab:svm_all_feature_result is as follows.

Notice that half of the test set are tagged malware, the other half benign.
For both categories, the accuracy is around 80%, i.e., 20% is wrongly classified.

Footnotes

[fn:1] Actually they all have zero variance.

gongzhitaao / malware-detection