🎬 Movie reviews Sentiment Analysis with DistilBERT

This project demonstrates fine-tuning a pretrained Transformer model (distilbert-base-uncased) on the IMDB movie reviews dataset for binary sentiment classification (positive / negative).

The goal of this repository is learning-oriented but industry-aligned:

Use a real-world dataset (IMDB)
Use Hugging Face Transformers + Datasets
Follow the same fine-tuning workflow used in production ML teams

📌 What this project covers

Loading a large NLP dataset using Hugging Face Datasets
Tokenizing text with a pretrained BERT-style tokenizer
Fine-tuning a pretrained DistilBERT model
Training with PyTorch DataLoader
Evaluating accuracy on a held-out test set
Running inference on custom sentences

🧠 Model

Base model: distilbert-base-uncased
Task: Sequence Classification
Labels:
- 0 → Negative review
- 1 → Positive review

The classification head is randomly initialized and then fine-tuned on IMDB.

📊 Dataset

Dataset: IMDB Movie Reviews
Source: Hugging Face Datasets
Size:
- Train: 25,000 reviews
- Test: 25,000 reviews

For faster experimentation, a subset of the dataset is used during training.

Each sample contains:

{
  "text": "movie review text",
  "label": 0 or 1
}

🏗️ Project Structure

.
├── imdb_finetuning.ipynb   # Main training notebook
├── README.md               # Project documentation

The notebook-first approach allows easy debugging and experimentation. The logic can later be migrated to a .py training script.

⚙️ Training Pipeline

Load IMDB dataset
Tokenize text (padding + truncation)
Convert dataset to PyTorch tensors
Create DataLoaders
Fine-tune DistilBERT using AdamW
Evaluate accuracy on test data

Loss is computed automatically by Hugging Face when labels are provided.

🚀 How to Run

1️⃣ Install dependencies

pip install torch transformers datasets tqdm

2️⃣ Open the notebook

jupyter notebook imdb_finetuning.ipynb

3️⃣ (Optional) Use GPU

If CUDA is available, the notebook automatically runs on GPU.

🧪 Example Inference

After training, the model can be tested on custom sentences:

texts = [
    "This movie was absolutely amazing",
    "I regret watching this film"
]

The model outputs predicted sentiment labels for each sentence.

📈 Results

Accuracy improves significantly after fine-tuning
The model learns sentiment even though the base model was not sentiment-trained

(Exact accuracy depends on dataset subset size and number of epochs.)

🔮 Next Steps

Possible extensions:

Convert notebook to a production-ready .py script
Add learning rate scheduler
Freeze base model layers
Save and reload the fine-tuned model
Dockerize the training environment

📚 References

Hugging Face Transformers
Hugging Face Datasets
DistilBERT: Smaller, Faster, Cheaper BERT

✨ Motivation

This project is meant to bridge the gap between theory (Transformers) and real-world ML workflows, using clean, minimal, and reproducible code.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Movie reviews Sentiment Analysis with DistilBERT

📌 What this project covers

🧠 Model

📊 Dataset

🏗️ Project Structure

⚙️ Training Pipeline

🚀 How to Run

1️⃣ Install dependencies

2️⃣ Open the notebook

3️⃣ (Optional) Use GPU

🧪 Example Inference

📈 Results

🔮 Next Steps

📚 References

✨ Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬 Movie reviews Sentiment Analysis with DistilBERT

📌 What this project covers

🧠 Model

📊 Dataset

🏗️ Project Structure

⚙️ Training Pipeline

🚀 How to Run

1️⃣ Install dependencies

2️⃣ Open the notebook

3️⃣ (Optional) Use GPU

🧪 Example Inference

📈 Results

🔮 Next Steps

📚 References

✨ Motivation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages