Natural Language Processing for Spam Detection and Text Generation

1. Project Overview

This repository applies traditional machine learning (SVM) and deep learning (LSTM) techniques to classify SMS messages as spam or ham. The project also includes a text generation model that produces spam-like messages, which are subsequently used to evaluate the performance of the classifiers.

Key Objectives:

Spam Detection: Using different feature extraction techniques (TF-IDF and Word2Vec) to classify messages into spam or ham.
Text Generation: Implementing an LSTM-based model to generate spam-like messages and test their classification.

2. Files in this Repository

Natural Language Analysis Report.pdf: The report contains detailed explanations and results for both SVM and LSTM models, along with the text generation task.
NLP_code.ipynb: The Jupyter notebook containing the code used to preprocess the data, train the models, and generate spam text.
SpamVsHam.tsv: The original dataset consists of SMS messages labeled spam or ham.

3. Instructions for Running the Code

To run the notebook (NLP_code.ipynb), ensure you have the following installed:

Python 3.13.0
Libraries: pandas, matplotlib, seaborn, sklearn, tensorflow, keras, nltk

Steps:

Install the necessary libraries using:

pip install pandas matplotlib seaborn scikit-learn tensorflow keras nltk

Open the Jupyter notebook in your environment and run the cells sequentially to replicate the results.
The notebook includes:
- Data preprocessing (handling missing values, tokenization, and vectorization).
- SVM model training with different feature extraction techniques (TF-IDF and Word2Vec).
- LSTM model training using both TF-IDF and Word2Vec features.
- Text generation using a Recurrent Neural Network (LSTM) to generate spam-like messages.
- Testing the classifiers using both real and generated data.

Expected Outputs:

Performance Metrics: Accuracy, precision, recall, and F1-scores for both SVM and LSTM models.
Text Generation: Generated spam-like messages used to evaluate the classifiers.

4. Analysis Overview

4.1 Feature Extraction Techniques

TF-IDF: Captures the importance of words based on their frequency across the corpus.
Word2Vec: Creates dense word embeddings that capture semantic relationships between words.

4.2 Handling Data Imbalance

The dataset is imbalanced, with 86.6% of the messages labeled as ham and only 13.4% as spam. This imbalance can lead to biased models that perform well for ham messages but poorly for spam.
Class Weighting: To address this, class weights were applied during model training. This technique assigns higher weights to the minority class (spam) to encourage the model to pay more attention to it.
SMOTE (Synthetic Minority Over-sampling Technique): I also considered an alternative to balance the classes by generating synthetic spam samples, but the best results were achieved using class weighting.

4.3 SVM Model Results

The SVM model was trained on both TF-IDF and Word2Vec features.
Best results: The SVM with TF-IDF features achieved an accuracy of 98%, significantly outperforming the Word2Vec-based model, especially in identifying spam messages.
- With class weighting, the recall for spam improved, reducing the number of missed spam messages.

4.4 LSTM Model Results

The LSTM model was trained using both TF-IDF and Word2Vec features.
Best results: The LSTM model with TF-IDF achieved 98% accuracy, but training with Word2Vec required more computational resources and slightly decreased accuracy.
- Class weights were also applied to the LSTM model, improving the model’s ability to identify spam messages, though at the cost of slightly increased false positives for ham messages.

4.5 Text Generation Results

The text generation model produced spam-like messages using the LSTM architecture.
These generated messages were used to evaluate the SVM and LSTM classifiers. After applying class weights, the SVM model correctly classified 95% of the generated spam, improving performance in handling spam detection.

5. Conclusion

This project successfully demonstrates the application of traditional machine learning (SVM) and deep learning (LSTM) techniques for spam detection. Addressing the data imbalance through class weighting significantly improved the recall for spam messages. Additionally, the text generation model highlights the ability of neural networks to generate coherent spam-like content, which can be used to evaluate classifier performance.

6. License

This project is for academic purposes only as part of the Natural Language Analysis module.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
NLP_code.ipynb		NLP_code.ipynb
Natural Language Analysis Report.pdf		Natural Language Analysis Report.pdf
README.md		README.md
SpamVsHam.tsv		SpamVsHam.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural Language Processing for Spam Detection and Text Generation

1. Project Overview

Key Objectives:

2. Files in this Repository

3. Instructions for Running the Code

Steps:

Expected Outputs:

4. Analysis Overview

4.1 Feature Extraction Techniques

4.2 Handling Data Imbalance

4.3 SVM Model Results

4.4 LSTM Model Results

4.5 Text Generation Results

5. Conclusion

6. License

About

Uh oh!

Releases

Packages

Languages

rubyyy1118/NLP_SPAM_Detection_and_Text_Generation

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing for Spam Detection and Text Generation

1. Project Overview

Key Objectives:

2. Files in this Repository

3. Instructions for Running the Code

Steps:

Expected Outputs:

4. Analysis Overview

4.1 Feature Extraction Techniques

4.2 Handling Data Imbalance

4.3 SVM Model Results

4.4 LSTM Model Results

4.5 Text Generation Results

5. Conclusion

6. License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages