Course: Keamanan Data
Lecturer: Puguh Hiskiawan
Contributors:
- Jason Jesse Joel Polii
- Christofer Agatha Ho
- Paskalis Peter
This project implements a machine learning pipeline for detecting spam emails. It includes data preprocessing, model training with handling for imbalanced data (SMOTE), model evaluation, and a Streamlit web application for real-time inference.
- Data Preprocessing: Text cleaning, tokenization, stopword removal, and stemming.
- Model Training: Trains Naive Bayes, Logistic Regression, and Linear SVM models.
- Imbalanced Data Handling: Uses SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.
- Evaluation: Comprehensive evaluation metrics (Accuracy, Precision, Recall, F1-Score) saved to JSON.
- Web App: Interactive Streamlit dashboard for spam detection, model performance visualization, and confusion matrix/ROC curve analysis.
-
Clone the repository:
git clone https://github.com/Jsznn/project-keamananData.git cd project-keamananData -
Create and activate a virtual environment (optional but recommended):
python -m venv venv # Windows .\venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
Clean and transform the raw dataset (data/spam.csv). This will generate data/processed_spam.csv.
python src/preprocessing.pyTrain the models and save them to the models/ directory. This script also saves the TF-IDF vectorizer and Tokenizer.
python entrypoint/train.pyEvaluate the trained models on the test set and generate a performance report. Metrics are saved to models/metrics.json and plots to models/plots/.
python entrypoint/evaluate.pyLaunch the web application to test the models with your own text and view evaluation metrics.
streamlit run app.pyconfig/: Configuration files (e.g.,config.ymlfor file paths and model hyperparameters).data/: Storage for raw (spam.csv) and processed (processed_spam.csv) datasets.entrypoint/: Scripts for training (train.py) and evaluation (evaluate.py).models/: Saved model artifacts (.pkl,.h5), evaluation metrics (metrics.json), and plots (plots/).notebooks/: Jupyter notebooks for initial data exploration and experimentation.src/: Source code for core functionality (e.g.,preprocessing.py).app.py: Streamlit application entry point.requirements.txt: List of Python dependencies.
- Naive Bayes (MultinomialNB): A baseline probabilistic classifier suitable for text data.
- Logistic Regression: A robust linear model for binary classification.
- Linear SVM: A Support Vector Machine optimized for high-dimensional sparse data (like text).
- Isolation Forest: An unsupervised learning algorithm for anomaly detection.
- CNN (Convolutional Neural Network): A deep learning model that captures local patterns in text data.
- RNN (Recurrent Neural Network): A deep learning model (LSTM) capable of capturing sequential dependencies in text.