A Comparision of Supervised Machine Learning Models to detect Phishing Websites

Objective

Ever since the emergence of the Internet, phishing, a fraudulent practice is always an area of concern. We have approached via machine learning to address this problem. Through this project, we comapre many supervised machine learning algorithms on an publicly available dataset that has equal number of phishing and legitimate URLs and have identified a model which effectively classifies if the given URL is a phishing site or not.

Data Collection

We acquired the data from Kaggle, a public data source. The dataset has 71677 unique URLs with some of the required features. It is a imbalanced dataset hence we balanced it with random sampling.

Data source : https://www.kaggle.com/aman9d/phishing-data

This base dataset is available in 'Main_dataset.csv' of this repository

Feature Enginnering

We extracted few of the domain based features and address bar features for the URLs in the base dataset. A decision tree was applied on this data to obtain the feature importance and the unecessary features were deleted from the dataset. This data is further split for training and testing.

Based on the document, 'Phishing Website Features.docx' in this repository, the values of each feature were converted to 0 for legitimate site and 1 for phishing site. The respective feature extraction process are in 'feature_exxtraction.py' file of this repository.

This new datastet is available in 'phishing_feature_engg.csv' of this repository

To understand the relationships and the correlation of the data, visualisations using Lux package in Python was done. These visualisations are available in 'Visualization_Lux_Phishing_Sites_Detection.ipynb' file of this repository.

Model Development

The supervised machine learning algorithms used for this analysis are Logistic Regression Naive Bayes Classifier Support Vector Machines Decision Tree Classifier Random Forest Classifier XGBoost Classifier Neural Network

These models were trained and tested on the feature extracted dataset and evaluations were done to identify the model with high performance. XGBoost algorithm had a good accuracy and fast testing time compared to the other algorithms. Later a grid search was done on the XGBoost for hyper parameter tuning.

The entire code for this project is available in 'Detecting_phishing_websites.ipynb' file of this repository.

Results

After fine tuning, XGBoost classifer was chosen as the final model with an accuracy of 82.4%. This model was saved as the final model through pickle module of Python. This file is available as 'phishing_classifier.pkl' in this repository.

Future Work

The saved model can be extended to a browser extension or can be added as a plugin to the internet security providers in order to to warn the users to avoid the phishing sites by efficiently identifying them.

Required Installations

Softwares

Jupyter notebook, Python 3 and above

Python packages

sklearn, numpy, pandas, pickle

lux, seaborn, matplotlib, xgboost

BeautifulSoup, whois, urllib, tldextract

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Extension		Extension
__pycache__		__pycache__
.gitignore		.gitignore
Detecting_phishing_websites.ipynb		Detecting_phishing_websites.ipynb
Main_dataset.csv		Main_dataset.csv
Phishing Websites Features.docx		Phishing Websites Features.docx
README.md		README.md
Visualization_Lux_Phishing_Sites_Detection.ipynb		Visualization_Lux_Phishing_Sites_Detection.ipynb
api.py		api.py
featureExtractor.py		featureExtractor.py
feature_extraction.py		feature_extraction.py
phishing_classifier.pkl		phishing_classifier.pkl
phishing_feature_engg.csv		phishing_feature_engg.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comparision of Supervised Machine Learning Models to detect Phishing Websites

About

Releases

Packages

Contributors 3

Languages

Phishing-Detection-System/Detection-Chrome-Extension

Folders and files

Latest commit

History

Repository files navigation

A Comparision of Supervised Machine Learning Models to detect Phishing Websites

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages