SMS Spam Detection Project

Part of my data science portfolio - Building a machine learning system for binary classification of SMS messages.

Project Overview

Developing a spam detection system using ML techniques, currently focusing on establishing strong baseline models and evaluation metrics.

Motivation

This project builds on prior work in text analysis (e.g., Word Cloud Visualization, Travel Blog Analysis) and classification (e.g., SME Closure Prediction). It establishes a solid foundation before diving into more sophisticated techniques, starting with strong baseline models and robust evaluation metrics to develop a deep understanding of the core challenges in classification.

Tech Stack

Data Processing & Analysis

Pandas: Data preprocessing and manipulation
NumPy: Numerical computing for feature engineering

Machine Learning & NLP

Scikit-learn: Classification algorithms and model evaluation
NLTK: Text preprocessing and tokenization

Data Visualization

WordCloud: Spam/ham text visualization
Matplotlib: Model performance visualization
Seaborn: Statistical analysis and confusion matrix plots

Project Structure

/sms-spam-classifier
├── README.md                        # Project overview and documentation
├── LICENSE                          # Project license file
├── requirements.txt                 # Python dependencies
├── notebooks/                       # Jupyter notebooks for analysis
├── data/                            # Dataset
├── tests/                          # Unit tests
├── assets/                         # Images
└── docs/                           # Project documentation

Current Progress

Implemented initial baseline models using different approaches:
- Count Vectorizer + Logistic Regression
- TF-IDF + Random Forest
Enhanced Exploratory Data Analysis (EDA) focusing on:
- Message length distribution analysis
- Text feature analysis (word count, special characters, capitals ratio, etc.)
- Word frequency visualization and word clouds
Basic text preprocessing and model evaluation completed

Next Steps

Model Performance Improvement
Code Structure Enhancement
Further EDA and Feature Engineering

Dataset

Using the UCI SMS Spam Collection Dataset from Kaggle
Binary classification: spam vs ham (non-spam) messages

This project is part of my journey to become a data scientist who solves real-world problems through data-driven solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SMS Spam Detection Project

Project Overview

Motivation

Tech Stack

Project Structure

Current Progress

Next Steps

Dataset

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
assets		assets
data		data
docs		docs
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

KwonNayeon/sms-spam-classifier

Folders and files

Latest commit

History

Repository files navigation

SMS Spam Detection Project

Project Overview

Motivation

Tech Stack

Project Structure

Current Progress

Next Steps

Dataset

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages