A machine learning project that detects whether a news article is fake or real based on its content.
Built in Google Colab using scikit-learn and custom text preprocessing.
This project applies Natural Language Processing (NLP) and Machine Learning to classify news articles as FAKE or REAL.
It includes:
- Regex-based text cleaning
- Vectorization with TF-IDF
- Models: Logistic Regression, Random Forest, Gradient Boosting
- Evaluation using accuracy, precision, recall, F1-score
The dataset used was the Fake and Real News Dataset available on Kaggle.
Columns used: text
(main content), label
(REAL or FAKE)
Other columns like title
, subject
, and date
were dropped.
Component | Description |
---|---|
โ Text Cleaning | Removed URLs, HTML tags, punctuation, digits using regex |
โ Vectorization | TF-IDF with TfidfVectorizer from sklearn |
โ Model Training | Trained 3 classifiers (LR, RFC, GBC) on the processed data |
โ Manual Testing | Custom input function to test any text against trained models |
โ Evaluation | classification_report used to show performance metrics |
- Open the project in Google Colab
- Run all cells in order (use "Runtime > Run all")
- Use the
manual_testing()
function to test your own headlines or articles
- Python ๐
- pandas ๐งฎ
- scikit-learn โ๏ธ
- re (Regex for text cleaning)
- Google Colab โ๏ธ
- 20% of the pipeline (cleaning + vectorization + model) delivers 80% of the classification performance
- Fitting vectorizers only once and reusing them prevents 100% of vector mismatch errors
- Restarting runtime and rerunning in sequence fixes most runtime issues efficiently