Skip to content

Advanced news topic classification system using DistilBERT transformer achieving 92.5% accuracy with overfitting prevention techniques. Classifies news articles into World, Sports, Business, and Technology categories.

License

Notifications You must be signed in to change notification settings

X-XENDROME-X/News-Classification-Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Advanced News Topic Classification with DistilBERT

News Classification Pipeline

A comprehensive deep learning project implementing state-of-the-art news topic classification using the DistilBERT transformer model. The system automatically categorizes news articles into four distinct topics with 92.5% accuracy while implementing advanced overfitting prevention techniques.


πŸš€ Table of Contents


✨ Features

πŸ“¦ Dataset & Model

  • AG News Corpus: 120,000 articles, 4 categories
  • DistilBERT-base-uncased (66M parameters)
  • Sequence classification head for multi-class prediction
  • Tokenization capped at 256 tokens

πŸ”§ ML Techniques

  • Early stopping with validation monitoring
  • Cosine learning rate scheduling
  • Weight decay (L2 regularization)
  • Dropout regularization (0.3)
  • Best model checkpointing

πŸ“Š Performance

  • Peak Accuracy: 92.92%
  • Final Accuracy: 92.46%
  • F1-Score: 92.44%
  • Training stopped early at 4,250 steps

βš™οΈ Production Features

  • Complete inference pipeline
  • Model serialization
  • Tested on real-world inputs

🧰 Tech Stack

  • PyTorch, Hugging Face Transformers
  • scikit-learn, pandas, matplotlib

πŸ—‚οΈ Project Structure


β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ image.png
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ News_Classification.ipynb


πŸ“Š Dataset Overview

The AG News Dataset includes categorized news articles across four domains:

Category Description Training Test
🌍 World Global news and international affairs 30,000 1,900
🏈 Sports Games, tournaments, and athlete updates 30,000 1,900
πŸ’Ό Business Market, finance, and economic reports 30,000 1,900
πŸ’» Tech Tech innovations, gadgets, and launches 30,000 1,900

Details:

  • Total: 120,000 train + 7,600 test
  • Avg. Length: 150 words/article
  • Preprocessed via DistilBERT tokenizer (max_length=256)
  • Perfectly balanced dataset

πŸ› οΈ Installation

# Step 1: Clone the repository
git clone https://github.com/X-XENDROME-X/News-Classification-Transformer.git

# Step 2: Set up virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Step 3: Install dependencies
pip install -r requirements.txt

# Step 4: Launch Jupyter
jupyter notebook News_Classification.ipynb

▢️ Usage

πŸ”¬ Full Training Pipeline

Inside the notebook:

  1. Environment setup
  2. Dataset loading & exploration
  3. Tokenization
  4. Model setup & training
  5. Evaluation
  6. Inference with real data

🎯 Quick Prediction

from transformers import pipeline
import torch

classifier = pipeline(
    "text-classification",
    model="./models/best_news_classifier",
    device=0 if torch.cuda.is_available() else -1
)

def classify_news(text):
    result = classifier(text)[0]
    return result['label'], result['score']

news_text = "Apple reports record quarterly earnings with strong iPhone sales driving revenue growth"
category, confidence = classify_news(news_text)

print(f"Category: {category}")
print(f"Confidence: {confidence:.3f}")

πŸ“ˆ Results & Visualizations

πŸ§ͺ Training Performance

Metric Step 3750 Step 4250 Status
Validation Accuracy 92.92% 92.46% βœ… Excellent
Validation Loss 0.2199 0.2346 βœ… Controlled
F1-Score 92.93% 92.44% βœ… Balanced
Overfitting Gap - 6.7% βœ… Minimal

πŸ† Highlights

  • 🎯 92.92% peak accuracy
  • πŸ›‘οΈ Early stopping effective
  • ⚑ Efficient: only 4,250 steps
  • 🎭 Balanced across all classes
  • πŸš€ Production-ready pipeline

πŸ”¬ Technical Implementation

🧠 Overfitting Control

  • Early stopping (patience=2)
  • Dropout (0.3)
  • Weight decay
  • Cosine learning rate scheduler

⚑ Optimization Techniques

  • Mixed precision (FP16)
  • Gradient accumulation
  • Dynamic padding
  • Best checkpoint saving

πŸ“ˆ Evaluation Metrics

  • Accuracy, Precision, Recall, F1
  • Confusion matrix
  • Per-class analysis
  • Real-world input validation

πŸ“Š Model Comparison

Model Accuracy Params Training Time Overfitting
DistilBERT (ours) 92.46% 66M 4,250 steps βœ… Low
BERT-base ~94% 110M ~8,000 steps Medium
Traditional ML ~85% <1M Fast High
Simple CNN ~88% ~10M Medium High

🀝 Contributing

🧩 How to Contribute

  1. Fork the repo
  2. Create a feature branch
  3. Commit changes
  4. Push to your fork
  5. Open a PR

πŸ’‘ Contribution Ideas

  • RoBERTa/ELECTRA integration
  • Multilingual support
  • Real-time API
  • Quantization/pruning
  • Advanced metrics
  • Deployment scripts (Docker, GCP, etc.)

πŸ“„ License

MIT License. See LICENSE for details.

About

Advanced news topic classification system using DistilBERT transformer achieving 92.5% accuracy with overfitting prevention techniques. Classifies news articles into World, Sports, Business, and Technology categories.

Topics

Resources

License

Stars

Watchers

Forks