Skip to content

Jsznn/project-keamananData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Email Spam Detection Project

Course: Keamanan Data
Lecturer: Puguh Hiskiawan

Contributors:

  • Jason Jesse Joel Polii
  • Christofer Agatha Ho
  • Paskalis Peter

This project implements a machine learning pipeline for detecting spam emails. It includes data preprocessing, model training with handling for imbalanced data (SMOTE), model evaluation, and a Streamlit web application for real-time inference.

🚀 Features

  • Data Preprocessing: Text cleaning, tokenization, stopword removal, and stemming.
  • Model Training: Trains Naive Bayes, Logistic Regression, and Linear SVM models.
  • Imbalanced Data Handling: Uses SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.
  • Evaluation: Comprehensive evaluation metrics (Accuracy, Precision, Recall, F1-Score) saved to JSON.
  • Web App: Interactive Streamlit dashboard for spam detection, model performance visualization, and confusion matrix/ROC curve analysis.

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/Jsznn/project-keamananData.git
    cd project-keamananData
  2. Create and activate a virtual environment (optional but recommended):

    python -m venv venv
    # Windows
    .\venv\Scripts\activate
    # macOS/Linux
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

⚙️ Usage

1. Data Preprocessing

Clean and transform the raw dataset (data/spam.csv). This will generate data/processed_spam.csv.

python src/preprocessing.py

2. Model Training

Train the models and save them to the models/ directory. This script also saves the TF-IDF vectorizer and Tokenizer.

python entrypoint/train.py

3. Model Evaluation

Evaluate the trained models on the test set and generate a performance report. Metrics are saved to models/metrics.json and plots to models/plots/.

python entrypoint/evaluate.py

4. Run Streamlit App

Launch the web application to test the models with your own text and view evaluation metrics.

streamlit run app.py

📂 Project Structure

  • config/: Configuration files (e.g., config.yml for file paths and model hyperparameters).
  • data/: Storage for raw (spam.csv) and processed (processed_spam.csv) datasets.
  • entrypoint/: Scripts for training (train.py) and evaluation (evaluate.py).
  • models/: Saved model artifacts (.pkl, .h5), evaluation metrics (metrics.json), and plots (plots/).
  • notebooks/: Jupyter notebooks for initial data exploration and experimentation.
  • src/: Source code for core functionality (e.g., preprocessing.py).
  • app.py: Streamlit application entry point.
  • requirements.txt: List of Python dependencies.

📊 Models Implemented

  • Naive Bayes (MultinomialNB): A baseline probabilistic classifier suitable for text data.
  • Logistic Regression: A robust linear model for binary classification.
  • Linear SVM: A Support Vector Machine optimized for high-dimensional sparse data (like text).
  • Isolation Forest: An unsupervised learning algorithm for anomaly detection.
  • CNN (Convolutional Neural Network): A deep learning model that captures local patterns in text data.
  • RNN (Recurrent Neural Network): A deep learning model (LSTM) capable of capturing sequential dependencies in text.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •