This repository contains a PyTorch implementation of smaller versions of the Vision Transformer (ViT) model introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". While the original paper focuses on base, large, and huge architectures, this implementation explores lightweight variants (tiny and small) suitable for smaller datasets and computational resources.
The main objective of this project is to:
- Implement a clean, understandable Vision Transformer from scratch
- Create lightweight variants of the original architecture
- Train and evaluate these models on CIFAR-10 and MNIST
- Provide a learning resource for understanding transformer architectures in computer vision
This implementation includes two compact variants of the original ViT:
{
'patch_size': 4,
'hidden_dim': 192,
'fc_dim': 768,
'num_heads': 3,
'num_blocks': 12,
'num_classes': 10
}
{
'patch_size': 4,
'hidden_dim': 384,
'fc_dim': 1536,
'num_heads': 6,
'num_blocks': 7,
'num_classes': 10
}
Compare to original ViT-Base:
{
'patch_size': 16,
'hidden_dim': 768,
'fc_dim': 3072,
'num_heads': 12,
'num_blocks': 12,
'num_classes': 1000
}
# Clone the repository
git clone https://github.com/ilyasoulk/mini-vit.git
cd mini-vit
# Install requirements
pip install -r requirements.txt
Train the model:
python src/train.py
Performance on MNIST:
- Test Accuracy: 98%
- Number of Parameters: 11 million
- Training Time: 2 Epochs on M2 Mac
Performance on CIFAR-10:
- Test Accuracy: 85%
- Number of Parameters: 11 million
- Training Time: 50 Epochs on NVIDIA T4
-
Clean Implementation: Each component of the Vision Transformer is implemented with clear, documented code:
- Patch Embedding
- Multi-Head Self-Attention
- MLP Block
- Position Embeddings
-
Modifications for Smaller Scale:
- Smaller patch size (4x4 instead of 16x16)
- Reduced model dimensions
- Fewer attention heads
- Fewer transformer blocks
-
Training Optimizations:
- AdamW optimizer
- Learning rate scheduling
- Data augmentation for CIFAR-10
mini-vit/
├── src/
│ ├── model.py # ViT model implementation
│ └── train.py # Training script
├── requirements.txt
└── README.md
The implementation includes several key components:
-
Patch Embedding:
- Divides input images into 4x4 patches
- Projects patches to embedding dimension
- Adds learnable position embeddings
-
Transformer Encoder:
- Multi-head self-attention mechanism
- Layer normalization
- MLP block with GELU activation
- Residual connections
- Framework: PyTorch
- Dataset: CIFAR-10
- Hardware: NVIDIA T4
- Training Time: 50 Epochs
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Original ViT Implementation
Give a ⭐️ if this project helped you understand Vision Transformers!