Create a static machine learning model based on batch data. The dataset that is used is from top secret files obtained from our allies Ring Canada (RC) and the Cyber Threat Intelligence (CTI). The dataset provided to you has DNS traffic generated by exfiltrating various filetypes ranging from small to large sizes.
The aim of the task is to implement a binary classifier aiming at predicting data exfiltration via DNS.
- Using the file called “static_dataset.csv”
- checked using plots and statistical tools the distribution of each feature and the target variable
- checked any type of data skewed pattern.
- Validated if your dataset is imbalanced
Analyzed the data inside the .csv file (static_dataset.csv) and transform the variables that contain string values, so that all of them can be used in the model. Check for missing values and categorical values.
Applied PCA dimensionality reduction on the dataset and found that best component is 13
Logistic Regression Model is used for binary classification problem Splited the data using a method you find suitable and justify it. Normalized your data and train the selected model.