ChCNN

A convolutional neural network approach to classify web requests. In Character-level Convolutional Networks for Text Classification (NIPS 2015) Xiang Zhang, Junbo Zhao and Yann LeCun showed that Character-level CNN's can be used for text classifications. Since HTTP is a text-based protocol and single characters play a significant role in malicious payloads, why not use this approach to identify malicious requests? Moreover it can ben applied anywhere in a request where is some text. For example in Morzeux_HttpParamsDataset only payloads are considered and in ISCX-URL-2016 classification is applied to URLs.

Requirements

Tensorflow, Keras, Jupyter Notebook, Pandas, Nvidia GPU and drivers... The best approach is to install conda. A good resource to get started

Dataset

ECML/PKDD 2007 and CSIC 2010 Datasets contain whole requests. HttpParamsDataset contains payloads. ISCX-URL-2016 contains only URLs.

ECML/PKDD 2007 Challenge (downloaded from https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets)
CSIC 2010 Dataset (downloaded from https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets)
HttpParamsDataset (downloaded from https://github.com/Morzeux/HttpParamsDataset)
ISCX-URL-2016 (downloaded from https://www.unb.ca/cic/datasets/url-2016.html)

Usage

Download datasets to corresponding directory
Run a transform from a chosen dataset in order to get train, test and validate samples.
Adjust config.json to reflect the number of classes and input characters size as follow:

ECML/PKDD 2007 (8 classes, approx. 2500 characters)
CSIC 2010 (2 classes, approx. 1400 characters)
HttpParamsDataset (5 classes, approx. 500 characters)
ISCX-URL-2016 (5 classes, approx. 1500 characters)

Run char-cnn

Results

Data is split into train, validate and test. Results below are for test samples after training and validating on train and validation samples. Cross-validation is in consideration. The performance could be affected since the model will not receive large batches of all classes. While training it can happen that loss suddenly starts to increase and explodes (especially for Morzeux_HttpParamsDataset). As far as I tested, lowering the learning rate in the Adam optimizer stabilizes the loss and even better results can be achieved (eg. optimizer=Adam(lr=0.0005) )

Results for ECML_PKDD after training for 30 Epochs with input_size of 2500 characters

Labels:

valid = 0
xss = 1
sqlinjection = 2
ldapinjection = 3
xpathinjection = 4
pathtransversal = 5
oscommanding = 6

ssi = 7

       precision    recall  f1-score   support

    0       1.00      1.00      1.00      6959
    1       0.99      1.00      1.00       371
    2       0.99      0.99      0.99       466
    3       1.00      1.00      1.00       445
    4       1.00      0.99      1.00       440
    5       0.99      0.99      0.99       494
    6       0.99      0.98      0.99       477
    7       0.99      0.99      0.99       372

Results for CSIC2010 after training for 30 Epochs with input_size of 1400 characters

Labels:

valid = 0

malicious = 1

        precision    recall  f1-score   support

     0       1.00      1.00      1.00     14404
     1       0.99      0.99      0.99      5009

Results for Morzeux_HttpParamsDataset after training for 10 Epochs with input_size of 500 characters

Labels:

valid = 0
sqli = 1
xss = 2
path-traversal = 3

cmdi = 4

        precision    recall  f1-score   support

     0       1.00      1.00      1.00      3803
     1       1.00      1.00      1.00      2219
     2       1.00      0.93      0.97       105
     3       0.93      0.96      0.95        74
     4       0.44      0.62      0.52        13

Results for ISCX-URL-2016 after training for 50 Epochs with input_size of 1500 characters

Labels:

benign = 0
defacement = 1
malware = 2
phishing = 3

spam = 4

        precision    recall  f1-score   support

     0       1.00      1.00      1.00      7081
     1       0.99      1.00      1.00     19264
     2       0.99      0.98      0.99      2282
     3       0.97      0.94      0.96      2040
     4       1.00      1.00      1.00      2407

Credits

Credits go to:

https://github.com/chaitjo/character-level-cnn (some code was reused)
https://gitlab.fing.edu.uy/gsi/web-application-attacks-datasets (for the dataset)
https://github.com/Morzeux/HttpParamsDataset (for the dataset)
https://www.unb.ca/cic/datasets/url-2016.html (for the dataset)

rashimo / ChCNN

ChCNN

Requirements

Dataset

Usage

Results

Similar work

Credits

About

Languages