RimTouny / Dynamic-DNS-Traffic-Analysis-for-Data-Exfiltration-Detection-with-Kafka

Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis. Static model trained on batch data, while dynamic model simulates a continuous stream. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enhanced Data Exfiltration Detection via Dynamic DNS Traffic Analysis usng Kafka

Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis using . Static model trained on batch data Static_dataset.csv while dynamic model simulates a continuous streamKafka_dataset.csv [that should treat as a data stream (local Kafka Server) which will be used to evaluate the dynamic model]. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Bianry-class classification problem

Task is to enhanced data exfiltration detection through DNS traffic analysis : 1 / 0.

Independent Variables:

  • 'timestamp': The time at which the data was recorded.
  • 'FQDN_count': The count of fully qualified domain names.
  • 'subdomain_length': The length of the subdomain.
  • 'upper': The count of uppercase characters.
  • 'lower': The count of lowercase characters.
  • 'numeric': The count of numeric characters.
  • 'entropy': Entropy value.
  • 'special': The count of special characters.
  • 'labels': The count of labels.
  • 'labels_max': Maximum count of labels.
  • 'labels_average': Average count of labels.
  • 'longest_word': The longest word in the subdomain.
  • 'sld': Second-level domain.
  • 'len': Length of the subdomain.
  • 'subdomain': The subdomain.

Target variable:

  • 'Target Attack' : Target Attack label, where 1 indicates an attack and 0 indicates no attack

Key Tasks Undertaken

  • Static Model

    1. Data Analysis:

      • Loaded and explored the "Static_dataset.csv."

      • Utilized various statistical tools and visualizations to understand feature distributions, identify imbalances, and assess the characteristics of numerical and categorical variables. merge_from_ofoct

      • Employed histograms, QQ plots, and boxplots for a comprehensive analysis of numerical features. merge_from_ofoct

      • Examined the count of attack and non-attack cases for categorical features through count plots. download

    2. Feature Engineering and Data Cleaning:

      • Analyzed the dataset for string variables and performed necessary transformations.
      • Addressed missing values within the dataset , duplicate rows , drop unnecessary features.
      • Applied embedding techniques to encode categorical variables, maintaining interpretability.
    3. Feature Filtering/Selection:

      • Employed different statistical techniques, including Mutual Information, ANOVA F-values, Chi-squared scores, and RandomForest-based Recursive Feature Elimination (RFE). merge_from_ofoct
      • Selected relevant features based on the results of the feature selection techniques.
    4. Model Selection: - Splitting data to train ,test. - Apply Normalization using StandardScaler. - Chose three machine learning models for evaluation: Random Forest, Logistic Regression, and XGBoost. - Configured each model with default parameters. merge_from_ofoct (2)

    5. Evaluation performance: - Using F1-score, get the Best Feature Selection/ Model

      Number of Best Feature:

    • Best F1-score is using Mutual Information on Random Forest Model.
      selected_features=['FQDN_count','entropy','labels','labels_average','longest_word','lower','sld','special']
    1. Hyperparameter Tuning & Model evaluation: using selected_features from Mutual Information.

      Best hyperparameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
      Best hyperparameters for Logistic Regression: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
      Best hyperparameters for XGB Extreme X Gradient Boosting: {'colsample_bytree': 0.8, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}```
      
    2. Champion Static Model :

    3. Save the Champion Model for the Dynamic phase.

  • **Dynamic Model

    1. Kafka Consumer Setup:

      • Created a Kafka consumer instance for 'ml-raw-dns' topic, connecting to a Kafka broker on 'localhost:9092'.
      • Configured the consumer to start from the earliest offset and use manual offset committing.
    2. Data Retrieval and Adjustment:

      • Implemented a function to retrieve 1000 records from the Kafka consumer.
      • Utilized the retrieved data to create a DataFrame with predefined columns.
    3. Data Cleaning: as done in Static Model.

      • Defined functions for adjusting and cleaning data, including converting categorical values to numerical indices.
      • Dropped unnecessary columns and converted the DataFrame to a consistent data type.
    4. Model Loading and Retraining:

      • Loaded a pre-trained Random Forest model from a pickle file.
      • Initialized both static and dynamic models with the loaded model.
    5. Dynamic Model Evaluation and Retraining:

      • Simulated continuous data processing over 199 iterations.
      • Evaluated the dynamic model's F1 score without retraining for each iteration.
      • Retrained the dynamic model if its F1 score fell below 0.80 and updated it with new training data.
    6. Static Model Evaluation:

      • Evaluated the F1 score of the static model for each iteration without retraining.
    7. Performance Comparison Visualization:

      • Plotted F1 scores of the dynamic model across iterations to observe its performance over time.

      • Plotted F1 scores of the static model across iterations for comparison.

      • Plotted F1 scores of both models on the same plot for a comprehensive comparison.

About

Crafting static and dynamic models for data exfiltration detection via DNS traffic analysis. Static model trained on batch data, while dynamic model simulates a continuous stream. Rigorous analysis, feature engineering, and model training conducted. Implementation part of AI for Cyber Security Master's assignment at the University of Ottawa, 2023.

License:MIT License


Languages

Language:Jupyter Notebook 100.0%