Skip to content

KwonNayeon/sms-spam-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thumbnail

SMS Spam Detection Project

Part of my data science portfolio - Building a machine learning system for binary classification of SMS messages.

Project Overview

Developing a spam detection system using ML techniques, currently focusing on establishing strong baseline models and evaluation metrics.

Motivation

This project builds on prior work in text analysis (e.g., Word Cloud Visualization, Travel Blog Analysis) and classification (e.g., SME Closure Prediction). It establishes a solid foundation before diving into more sophisticated techniques, starting with strong baseline models and robust evaluation metrics to develop a deep understanding of the core challenges in classification.

Tech Stack

Python Scikit-learn Pandas NLTK

Data Processing & Analysis

  • Pandas: Data preprocessing and manipulation
  • NumPy: Numerical computing for feature engineering

Machine Learning & NLP

  • Scikit-learn: Classification algorithms and model evaluation
  • NLTK: Text preprocessing and tokenization

Data Visualization

  • WordCloud: Spam/ham text visualization
  • Matplotlib: Model performance visualization
  • Seaborn: Statistical analysis and confusion matrix plots

Project Structure

/sms-spam-classifier
├── README.md                        # Project overview and documentation
├── LICENSE                          # Project license file
├── requirements.txt                 # Python dependencies
├── notebooks/                       # Jupyter notebooks for analysis
├── data/                            # Dataset
├── tests/                          # Unit tests
├── assets/                         # Images
└── docs/                           # Project documentation

Current Progress

  • Implemented initial baseline models using different approaches:
    • Count Vectorizer + Logistic Regression
    • TF-IDF + Random Forest
  • Enhanced Exploratory Data Analysis (EDA) focusing on:
    • Message length distribution analysis
    • Text feature analysis (word count, special characters, capitals ratio, etc.)
    • Word frequency visualization and word clouds
  • Basic text preprocessing and model evaluation completed

Next Steps

  1. Model Performance Improvement
  2. Code Structure Enhancement
  3. Further EDA and Feature Engineering

Dataset

  • Using the UCI SMS Spam Collection Dataset from Kaggle
  • Binary classification: spam vs ham (non-spam) messages

This project is part of my journey to become a data scientist who solves real-world problems through data-driven solutions.

About

SMS spam classification

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •