A comprehensive deep learning project implementing state-of-the-art news topic classification using the DistilBERT transformer model. The system automatically categorizes news articles into four distinct topics with 92.5% accuracy while implementing advanced overfitting prevention techniques.
- β¨ Features
- ποΈ Project Structure
- π Dataset Overview
- π οΈ Installation
βΆοΈ Usage- π Results & Visualizations
- π¬ Technical Implementation
- π€ Contributing
- π License
- AG News Corpus: 120,000 articles, 4 categories
- DistilBERT-base-uncased (66M parameters)
- Sequence classification head for multi-class prediction
- Tokenization capped at 256 tokens
- Early stopping with validation monitoring
- Cosine learning rate scheduling
- Weight decay (L2 regularization)
- Dropout regularization (0.3)
- Best model checkpointing
- Peak Accuracy: 92.92%
- Final Accuracy: 92.46%
- F1-Score: 92.44%
- Training stopped early at 4,250 steps
- Complete inference pipeline
- Model serialization
- Tested on real-world inputs
- PyTorch, Hugging Face Transformers
- scikit-learn, pandas, matplotlib
βββ .gitignore
βββ LICENSE
βββ README.md
βββ image.png
βββ requirements.txt
βββ News_Classification.ipynb
The AG News Dataset includes categorized news articles across four domains:
| Category | Description | Training | Test |
|---|---|---|---|
| π World | Global news and international affairs | 30,000 | 1,900 |
| π Sports | Games, tournaments, and athlete updates | 30,000 | 1,900 |
| πΌ Business | Market, finance, and economic reports | 30,000 | 1,900 |
| π» Tech | Tech innovations, gadgets, and launches | 30,000 | 1,900 |
Details:
- Total: 120,000 train + 7,600 test
- Avg. Length: 150 words/article
- Preprocessed via DistilBERT tokenizer (
max_length=256) - Perfectly balanced dataset
# Step 1: Clone the repository
git clone https://github.com/X-XENDROME-X/News-Classification-Transformer.git
# Step 2: Set up virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Step 3: Install dependencies
pip install -r requirements.txt
# Step 4: Launch Jupyter
jupyter notebook News_Classification.ipynbInside the notebook:
- Environment setup
- Dataset loading & exploration
- Tokenization
- Model setup & training
- Evaluation
- Inference with real data
from transformers import pipeline
import torch
classifier = pipeline(
"text-classification",
model="./models/best_news_classifier",
device=0 if torch.cuda.is_available() else -1
)
def classify_news(text):
result = classifier(text)[0]
return result['label'], result['score']
news_text = "Apple reports record quarterly earnings with strong iPhone sales driving revenue growth"
category, confidence = classify_news(news_text)
print(f"Category: {category}")
print(f"Confidence: {confidence:.3f}")| Metric | Step 3750 | Step 4250 | Status |
|---|---|---|---|
| Validation Accuracy | 92.92% | 92.46% | β Excellent |
| Validation Loss | 0.2199 | 0.2346 | β Controlled |
| F1-Score | 92.93% | 92.44% | β Balanced |
| Overfitting Gap | - | 6.7% | β Minimal |
- π― 92.92% peak accuracy
- π‘οΈ Early stopping effective
- β‘ Efficient: only 4,250 steps
- π Balanced across all classes
- π Production-ready pipeline
- Early stopping (patience=2)
- Dropout (0.3)
- Weight decay
- Cosine learning rate scheduler
- Mixed precision (FP16)
- Gradient accumulation
- Dynamic padding
- Best checkpoint saving
- Accuracy, Precision, Recall, F1
- Confusion matrix
- Per-class analysis
- Real-world input validation
| Model | Accuracy | Params | Training Time | Overfitting |
|---|---|---|---|---|
| DistilBERT (ours) | 92.46% | 66M | 4,250 steps | β Low |
| BERT-base | ~94% | 110M | ~8,000 steps | Medium |
| Traditional ML | ~85% | <1M | Fast | High |
| Simple CNN | ~88% | ~10M | Medium | High |
- Fork the repo
- Create a feature branch
- Commit changes
- Push to your fork
- Open a PR
- RoBERTa/ELECTRA integration
- Multilingual support
- Real-time API
- Quantization/pruning
- Advanced metrics
- Deployment scripts (Docker, GCP, etc.)
MIT License. See LICENSE for details.
