gangeshbaskerr / Phishing-Website-Detection

A project that predicts a phishing URL by extracting 17 features in 3 different categories and then train and test the machine learning models using a dataset from Phishtank.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ ]

made-with-jupyter-notebookbuilt-by-team-geeksforthebadge GitHub last commit GitHub watchers


Inspiration

Phishing attacks have emerged as a significant and persistent threat in the digital landscape, targeting individuals, organizations, and even governments. These deceptive techniques employed by cybercriminals aim to trick unsuspecting users into divulging sensitive information, such as login credentials, financial details, or personal data.

Research shows that Over 48% of emails sent in 2022 were spam, and up to an estimated 3.4 billion spam emails sent every day. Globally, 323,972 internet users fell victim to phishing attacks in 2021 and With an average of $136 lost per phishing attack, this amounts to $44.2 million stolen by cyber criminals through phishing attacks in 2021.


Problem Statement

Phishing attacks pose a significant threat to online users, compromising their privacy, financial security, and trust in online interactions. Detecting and mitigating phishing sites remains challenging, requiring effective techniques to identify and differentiate between legitimate and malicious websites accurately.

Existing phishing detection methods often struggle to keep pace with the evolving tactics employed by cybercriminals, necessitating the development of an enhanced approach for phishing site detection.

Therefore, a critical need is to develop an improved system combining advanced machine learning techniques, feature engineering, and behavioural analysis to detect phishing sites accurately and efficiently. By addressing these challenges, the proposed methodology aims to improve the security of online users, protect their sensitive information, and foster a safer digital environment.


Introduction

The aim is to contribute to developing a more secure digital environment by offering an advanced approach to phishing site detection. By accurately identifying and mitigating phishing threats, the proposed model will enhance the safety and trustworthiness of online interactions, protecting users from falling victim to phishing attacks.

In the following sections, we will discuss the related literature, present the methodology, describe the experiments and results, and conclude with the implications and future directions of the research.


Approach

• Datasets containing phishing and legitimate websites is collected from open-source platform PhishTank.

• Write a code to extract the required features from the URL database.

• Analyze and preprocess the dataset by using EDA techniques.

• Divide the dataset into training and testing sets.

• Run selected machine learning and deep neural network algorithms on the dataset like Decision Tree , Random Forest, Multilayer Perceptrons, XGBoost, Autoencoder Neural Networks and Support Vector Machines on the dataset .

• Write a code for displaying the evaluation result considering accuracy metrics.

• Compare the obtained results for trained models and specify which is better.


Procedure

1️⃣ Pre-install all the required libraries

   1) Tensoflow
   2) Numpy
   3) Pandas
   4) SciKit-Learn

2️⃣ Understand the dataset

Datasets containing phishing and legitimate websites is collected from open-source platform PhishTank. click here!

This service provide a set of phishing URLs in multiple formats like csv, json etc. that gets updated hourly. From this dataset, 5000 random phishing URLs are collected to train the machine learning models.

The legitimate URLs are obatined from the open datasets of the University of New Brunswick, click here!. This dataset has a collection of benign, spam, phishing, malware & defacement URLs. Out of all these types, the benign url dataset is considered for this project. From this dataset, 5000 random legitimate URLs are collected to train the ML models.

3️⃣ Feature Extraction

The below-mentioned category of features are extracted from the URL data: ​

  1. Addressed Bar-based features​

    • In this category, 9 features are extracted.​

  2. Domain-based Features​

    • In this category, 4 features are extracted.​

  3. HTML & Javascript-based Features​

    • In this category, 4 features are extracted. ​

So, all together 17 features are extracted from the 10,000 URL dataset and are stored in '5.urldata.csv' file in the Data Files folder​

4️⃣ Build and train the model

Before starting the ML model training, the data is split into 80-20, i.e., 8000 training samples & 2000 testing samples. From the dataset, it is clear that this is a supervised machine-learning task.

This data set comes under a classification problem, as the input URL is classified as phishing (1) or legitimate (0). ​

The supervised machine learning models (classification) considered to train the dataset in this project are:

• Decision Tree

• Random Forest

• Multilayer Perceptrons

• XGBoost

• Autoencoder Neural Network

• Support Vector Machines

5️⃣ Save the model

   save the model and calculate the training and testing accuracy ,

Tesing and Training accuracy

We did 50 epochs, to get a good accuracy from the XGBoost model i.e. 86.7% for training accuracy and 85.8% for testing accuracy.


Result

From the obtained results of the above models, XGBoost Classifier has highest model performance of 86.7%. So the model is saved to the file 'XGBoostClassifier.pickle.dat'


Learnings

  1. Building various machine learning models :

    How to build, train and fine-tune Decision Tree, Random Forest, Multilayer Perceptrons, XGBoost, Autoencoder Neural Networks and Support Vector Machines .

  2. Machine Learning :

    How to use machine learning for identifiying the phishing site.

  3. URL'S and http:

    I have studied how the URL'S and http of a phishing website are identified .

  4. How to extract features from dataset :

    How to extract the features from the dataset so that the machine learning could happen in a much better and efficient way.

  5. Different aspects of drowsiness during driving :

    I have studied the different causes and reasons for drowsiness to occur and how to resolve it.

  6. URL'S and http :

    I have studied how the URL'S and http of a phishing website are identified .

  7. Team work :

    Collaborating and communicating effectively in a team to deliver a project.

  8. Understanding the need for a phishing website detection :

    These are just a few examples of the knowledge and skills that i likely gained while building this project. Overall, building a phishing site detection model is a challenging and rewarding experience that requires a combination of technical expertise and knowledge .


Project Deployment

We have built an app using Flutter. Flutter helps Build, test, and deploy beautiful mobile, web, desktop, and embedded apps from a single codebase. It is a cross-platform app development framework by Google which goes hand in hand with the model to help ensure the safety of the user.


One more thing

  1. Browser Extension : This project can be taken further by creating a browser extensions by developing a GUI.

  2. The machine learning models shown here can be easily served as REST API endpoints which can further be used with add-ons to detect phishing websites in real-time.

  3. As this is a software solution this can be easily intergreted into various platfroms with minimum issues and effort, futhermore as we encounter new links we can forvever improve on the accuracy by getting real time feedback from users.


Conclusion

Through this project there will be 40-50% decrease in number of phishing attacks occuring. If this project is used efficiently it may also lead to huge decrease in percentage of phishing attacks .


About

A project that predicts a phishing URL by extracting 17 features in 3 different categories and then train and test the machine learning models using a dataset from Phishtank.


Languages

Language:Jupyter Notebook 95.5%Language:Python 3.9%Language:Dart 0.6%