The Spam Email Detection project uses machine learning to identify and filter spam emails from legitimate ones (ham). By analyzing the content and structure of emails, the system learns patterns and features commonly found in spam, such as specific keywords, links, or sender information.
The dataset is obtained from kaggle.
It contains a total of 5572 sample emails among which 4,825 are ham (not spam) and 747 are spam.
The initial examination of data to identify patterns, trends, anomalies, and potential insights.
The process of correcting missing values, duplicates, and errors in the data.
1) Distribution of Labels.
2) Average Length of Emails for Spam and Ham.
3)Average Word of Emails for Spam and Ham.
4) Relationship between Length and Spam.
5) Relationship between Features.
The basic data pre-processing steps of spam detection are:
1) Converting the input to lowercase.
2) Tokenisation.
3) All the special characters are removed.
4) Removing stop words and Punctuation.
5) The word frequency of all the words.
6) Stemming to reduce words to their root forms.
Label Encoding : A technique used to transform categorical labels (text labels like "ham" and "spam") into numeric labels for machine learning algorithms, which often work better with numerical data.
Vectorization : A process to convert textual data into numerical features so that machine learning models can process it. Textual data is unstructured, and vectorization makes it machine-readable.
Word and Positional embedding layer : Creates a word embedding, which projects the input indexes, (i.e., the tokens), into a vector space, that contains unique and informative representations for each token.
Transformer Layer :The Transformer network processes input text as word embeddings through several Transformer blocks. It generates a vector representation of the text, which is used for classification with a softmax layer. Using self-attention and multi-head attention, the network effectively captures relationships between words, enabling accurate predictions.
Encoder Layer :The encoder consists of multiple layers of attention mechanisms and feed-forward neural networks.
Self-Attention : Each word or token in the email attends to every other token in the email to capture contextual relationships.
Multi-Head Attention : he model uses multiple attention heads to capture different aspects of the relationships between words.
Feed-Forward Layer : A feed-forward neural network processes the information. The output of this layer is then passed through a normalization layer.
Layer Normalization : Normalization is used to stabilize the training process by scaling and shifting the outputs of each layer.
num_heads: Number of parallel attention mechanisms used in a multi-head attention layer.
vocab_size: The total number of unique tokens in the model's vocabulary.
embed_dim: The size of the vector representation for each token in the input sequence.
ff_dim: The dimensionality of the inner layer in the feedforward neural network used in the Transformer model.
max_seq_len: The maximum length of input sequences the model can process.
Number of Labels: 2 (Spam vs. Ham)
Input Sequence Length: Up to 512 tokens.
Number of Transformer Layers: 6 layers.
Number of Attention Heads per Layer: 12 heads.
Intermediate Feedforward Layer Size: 3072.
Classification head : This layer takes the output from the Transformer and produces a single probability score representing the likelihood of the email being spam or ham.
Softmax Layer : Converts scores into probabilities, showing how likely the email is to belong to each class.
Number of Training Epochs: Specifies how many complete passes through the training dataset the model will perform.
Training Batch Size: Determines the number of samples processed together during a forward and backward pass.
Evaluation Batch Size: Optimized for faster inference while maintaining memory efficiency.
Learning Rate: Defines the step size at which the optimizer updates the model weights.
-
Input Preparation : Input consists of tokenized sequences (texts) with their binary labels (spam: 1, ham: 0).
-
Forward Pass : The tokenized input is passed through the Transformer model, which outputs logits for binary classification. Loss Calculation: The Binary Cross-Entropy Loss with Logits is calculated.
-
Backpropagation : Gradients are calculated.
-
Optimization : The optimizer updates the model's parameters.
-
Validation : The model is evaluated on the validation set.
Loss Function : torch.nn.BCEWithLogitsLoss().
It combines a sigmoid layer with binary cross-entropy loss for better numerical stability.
Optimizer : torch.optim.AdamW with weight_decay=0.01 for regularization.
Learning Rate Scheduler : Linear decay scheduler with warmup (e.g., get_scheduler from HuggingFace).
Loss : The average cross-entropy loss measures the difference between the predicted probabilities and the actual class labels.
Accuracy : Percentage of correctly classified examples.
Precision : Measures the proportion of positive predictions (spam) that were actually correct.
Recall : Measures the proportion of actual positives (spam) that were correctly identified.
F1-Score : The harmonic mean of precision and recall, providing a balance between the two.
F1-score (weighted) : Weighted average of F1-scores per class.