This project demonstrates fine-tuning a pretrained Transformer model (distilbert-base-uncased) on the IMDB movie reviews dataset for binary sentiment classification (positive / negative).
The goal of this repository is learning-oriented but industry-aligned:
- Use a real-world dataset (IMDB)
- Use Hugging Face Transformers + Datasets
- Follow the same fine-tuning workflow used in production ML teams
- Loading a large NLP dataset using Hugging Face Datasets
- Tokenizing text with a pretrained BERT-style tokenizer
- Fine-tuning a pretrained DistilBERT model
- Training with PyTorch
DataLoader - Evaluating accuracy on a held-out test set
- Running inference on custom sentences
-
Base model:
distilbert-base-uncased -
Task: Sequence Classification
-
Labels:
0→ Negative review1→ Positive review
The classification head is randomly initialized and then fine-tuned on IMDB.
-
Dataset: IMDB Movie Reviews
-
Source: Hugging Face Datasets
-
Size:
- Train: 25,000 reviews
- Test: 25,000 reviews
For faster experimentation, a subset of the dataset is used during training.
Each sample contains:
{
"text": "movie review text",
"label": 0 or 1
}.
├── imdb_finetuning.ipynb # Main training notebook
├── README.md # Project documentation
The notebook-first approach allows easy debugging and experimentation. The logic can later be migrated to a
.pytraining script.
- Load IMDB dataset
- Tokenize text (padding + truncation)
- Convert dataset to PyTorch tensors
- Create
DataLoaders - Fine-tune DistilBERT using AdamW
- Evaluate accuracy on test data
Loss is computed automatically by Hugging Face when labels are provided.
pip install torch transformers datasets tqdmjupyter notebook imdb_finetuning.ipynbIf CUDA is available, the notebook automatically runs on GPU.
After training, the model can be tested on custom sentences:
texts = [
"This movie was absolutely amazing",
"I regret watching this film"
]The model outputs predicted sentiment labels for each sentence.
- Accuracy improves significantly after fine-tuning
- The model learns sentiment even though the base model was not sentiment-trained
(Exact accuracy depends on dataset subset size and number of epochs.)
Possible extensions:
- Convert notebook to a production-ready
.pyscript - Add learning rate scheduler
- Freeze base model layers
- Save and reload the fine-tuned model
- Dockerize the training environment
- Hugging Face Transformers
- Hugging Face Datasets
- DistilBERT: Smaller, Faster, Cheaper BERT
This project is meant to bridge the gap between theory (Transformers) and real-world ML workflows, using clean, minimal, and reproducible code.
