STAT-576-Phishing-Final-Project

Introduction

A final project authored by Cory Suzuki, Nathaniel Talampas, and Richard Diaz DeLeon for Dr. Seungjoon Lee's Unsupervised Learning class. Here, we perform dimensionality reduction techniques for feature extraction and utilize clustering methods to explicate insightful trends on the classification of phishing urls.

Motivation

In today's digital age, phishing attacks have become a major threat to individuals and organizations alike. These attacks exploit users' trust, resulting in significant financial losses and breaches of sensitive information. To address this widespread issue, it is crucial to develop effective methods for identifying and mitigating phishing attempts.

About-The-Data

The Phishing URL Dataset consists of URLs labeled as either phishing or legitimate. Phishing URLs are designed to deceive users into providing sensitive information, such as usernames and passwords, often mimicking legitimate sites. Data can be found at the following link: https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset

Methods

FEATURE SELECTION BY CORRELATION
PRINCIPAL COMPONENT ANALYSIS (PCA)
MULTIDIMENSIONAL SCALING (MDS)
ISOMAP
LOCALLY LINEAR EMBEDDING (LLE)
T-STOCHASTIC NEIGHBOR EMBEDDING (T-SNE)
K-MEANS CLUSTERING
MINIBATCH K-MEANS CLUSTERING
AGGLOMERATIVE (SINGLE LINKAGE) CLUSTERING
DBSCAN

Data-Preprocessing

The data was retrieved directly from the UCI ML Repository using the Python code below. Data Preprocessing included one-hot encoding binary features, label encoding the target feature, standardizing the data, removing any duplicates or missing values, checking for target feature class balances, and ensuring all features are either numerical or categorical.

Feature-Selection

The data was then modified by using feature selection to remove any redundant features and information. The algorithm of choice was feature selection by correlation since many features were too highly correlated with each other according to the EDA performed in our work. Our threshold choice was 0.85 in order to eliminate any features that had correlations too close to 1.00.

Feature Extraction

This step was necessary to preserve the latent information from the remaining features in the data. We implemented the above linear and nonlinear dimensionality reduction techniques/manifold learning algorithms to extract the most important preserved dimensions. Due to our investigations, we concluded that t-SNE in the second and third dimensions produced promising results as it best captured the nonlinear nature of the data.

Clustering

An assortment of clustering methods were employed to dynamically classify the data into two classes, with 1=phishing url and 0=not phishing url. Out of the above methods, K-Means and MiniBatch K-Means provided the most promising Adjusted Rand Index (ARI) accuracy.

Future Work

To improve this project and provide a more robust analysis of the data, we consider the following recommendations:

Consider deep-learning methods and semi-supervised learning algorithms.
Concatenate more current data or real-time data to analyze current trends.
Treat the distance metric as a possible hyperparameter to tune, as this project only assumed the usage of the Euclidian metric (i.e: use Mahalanobis, Manhattan, etc.).

For further analysis and information about this project, refer to the report, Python notebooks, and powerpoint slides.

Updated as of 12/2/24 by Cory Suzuki

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Phishing_DimRed_Nate_ver.ipynb		Phishing_DimRed_Nate_ver.ipynb
Phishing_Final_576.ipynb		Phishing_Final_576.ipynb
Phishing_Nate_V2.ipynb		Phishing_Nate_V2.ipynb
Phishing_final_project.ipynb		Phishing_final_project.ipynb
README.md		README.md
STAT 576 PRESENTATION.pdf		STAT 576 PRESENTATION.pdf
STAT_576_PROJECT_finaldraft.pdf		STAT_576_PROJECT_finaldraft.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STAT-576-Phishing-Final-Project

Table of Contents

Introduction

Motivation

About-The-Data

Methods

Data-Preprocessing

Feature-Selection

Feature Extraction

Clustering

Future Work

About

Uh oh!

Releases

Packages

Languages

n8tlmps/PhishingURL-Project

Folders and files

Latest commit

History

Repository files navigation

STAT-576-Phishing-Final-Project

Table of Contents

Introduction

Motivation

About-The-Data

Methods

Data-Preprocessing

Feature-Selection

Feature Extraction

Clustering

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages