This document describes the Machine Learning models used in the chat application for:
- Spam Detection
- Toxic Message Detection
- File:
spam.csv - Columns used:
v1→ label (ham/spam)v2→ message text
- Renamed columns to
labelandmessage - Converted labels:
ham→ 0spam→ 1
- Removed unnecessary columns
- Technique: TF-IDF Vectorization
- Parameters:
stop_words = 'english'max_features = 5000
- Algorithm: Multinomial Naive Bayes
- Train/Test Split:
- 80% training
- 20% testing
- Random State: 42
- Accuracy Score
- Classification Report (Precision, Recall, F1-score)
def predict_spam(text):
text_vec = vectorizer.transform([text])
prob = model.predict_proba(text_vec)[0][1]
return prob- "Free money click now!!!" → High spam probability
- "Let's meet tomorrow" → Low spam probability
spam_model.pklspam_vectorizer.pkl
- File:
data.csv - Columns used:
comment_texttarget
- Created binary label:
toxic = 1iftarget > 0.5- else
0
- Selected relevant columns:
comment_text,toxic
- Technique: TF-IDF Vectorization
- Parameters:
stop_words = 'english'max_features = 10000
- Algorithm: Logistic Regression
- Parameters:
max_iter = 1000
- Train/Test Split:
- 80% training
- 20% testing
- Random State: 42
- Accuracy Score
- Classification Report
def predict_toxicity(text):
text_vec = tfidf.transform([text])
prob = model.predict_proba(text_vec)[0][1]
return prob- "You are stupid" → High toxicity
- "Have a great day" → Low toxicity
toxic_model.pkltoxic_vectorizer.pkl
Both models are used for real-time message analysis.
- User sends message
- Message is vectorized
- Passed to ML model
- Probability score generated
- If threshold exceeded:
- Flag as spam/toxic
- Take action (warn, block, or filter)
- Use deep learning models (LSTM or Transformers)
- Improve dataset quality
- Add multilingual support
- Real-time model retraining
- Context-aware toxicity detection
spam_model.pklspam_vectorizer.pkltoxic_model.pkltfidf_vectorizer.pkl
| Feature | Model | Technique |
|---|---|---|
| Spam Detection | Naive Bayes | TF-IDF |
| Toxic Detection | Logistic Regression | TF-IDF |