Email Spam Detection Project

Course: Keamanan Data
Lecturer: Puguh Hiskiawan

Contributors:

Jason Jesse Joel Polii
Christofer Agatha Ho
Paskalis Peter

This project implements a machine learning pipeline for detecting spam emails. It includes data preprocessing, model training with handling for imbalanced data (SMOTE), model evaluation, and a Streamlit web application for real-time inference.

🚀 Features

Data Preprocessing: Text cleaning, tokenization, stopword removal, and stemming.
Model Training: Trains Naive Bayes, Logistic Regression, and Linear SVM models.
Imbalanced Data Handling: Uses SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.
Evaluation: Comprehensive evaluation metrics (Accuracy, Precision, Recall, F1-Score) saved to JSON.
Web App: Interactive Streamlit dashboard for spam detection, model performance visualization, and confusion matrix/ROC curve analysis.

🛠️ Installation

Clone the repository:

git clone https://github.com/Jsznn/project-keamananData.git
cd project-keamananData

Create and activate a virtual environment (optional but recommended):

python -m venv venv
# Windows
.\venv\Scripts\activate
# macOS/Linux
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

⚙️ Usage

1. Data Preprocessing

Clean and transform the raw dataset (data/spam.csv). This will generate data/processed_spam.csv.

python src/preprocessing.py

2. Model Training

Train the models and save them to the models/ directory. This script also saves the TF-IDF vectorizer and Tokenizer.

python entrypoint/train.py

3. Model Evaluation

Evaluate the trained models on the test set and generate a performance report. Metrics are saved to models/metrics.json and plots to models/plots/.

python entrypoint/evaluate.py

4. Run Streamlit App

Launch the web application to test the models with your own text and view evaluation metrics.

streamlit run app.py

📂 Project Structure

config/: Configuration files (e.g., config.yml for file paths and model hyperparameters).
data/: Storage for raw (spam.csv) and processed (processed_spam.csv) datasets.
entrypoint/: Scripts for training (train.py) and evaluation (evaluate.py).
models/: Saved model artifacts (.pkl, .h5), evaluation metrics (metrics.json), and plots (plots/).
notebooks/: Jupyter notebooks for initial data exploration and experimentation.
src/: Source code for core functionality (e.g., preprocessing.py).
app.py: Streamlit application entry point.
requirements.txt: List of Python dependencies.

📊 Models Implemented

Naive Bayes (MultinomialNB): A baseline probabilistic classifier suitable for text data.
Logistic Regression: A robust linear model for binary classification.
Linear SVM: A Support Vector Machine optimized for high-dimensional sparse data (like text).
Isolation Forest: An unsupervised learning algorithm for anomaly detection.
CNN (Convolutional Neural Network): A deep learning model that captures local patterns in text data.
RNN (Recurrent Neural Network): A deep learning model (LSTM) capable of capturing sequential dependencies in text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email Spam Detection Project

🚀 Features

🛠️ Installation

⚙️ Usage

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Run Streamlit App

📂 Project Structure

📊 Models Implemented

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
config		config
data		data
entrypoint		entrypoint
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

Jsznn/project-keamananData

Folders and files

Latest commit

History

Repository files navigation

Email Spam Detection Project

🚀 Features

🛠️ Installation

⚙️ Usage

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Run Streamlit App

📂 Project Structure

📊 Models Implemented

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages