Mini-ViT: An Image is Worth 4x4 Words

This repository contains a PyTorch implementation of smaller versions of the Vision Transformer (ViT) model introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". While the original paper focuses on base, large, and huge architectures, this implementation explores lightweight variants (tiny and small) suitable for smaller datasets and computational resources.

🎯 Project Goals

The main objective of this project is to:

Implement a clean, understandable Vision Transformer from scratch
Create lightweight variants of the original architecture
Train and evaluate these models on CIFAR-10 and MNIST
Provide a learning resource for understanding transformer architectures in computer vision

🏗️ Model Architecture

This implementation includes two compact variants of the original ViT:

ViT-Tiny

{
    'patch_size': 4,
    'hidden_dim': 192,
    'fc_dim': 768,
    'num_heads': 3,
    'num_blocks': 12,
    'num_classes': 10
}

ViT-Small (Current Implementation)

{
    'patch_size': 4,
    'hidden_dim': 384,
    'fc_dim': 1536,
    'num_heads': 6,
    'num_blocks': 7,
    'num_classes': 10
}

Compare to original ViT-Base:

{
    'patch_size': 16,
    'hidden_dim': 768,
    'fc_dim': 3072,
    'num_heads': 12,
    'num_blocks': 12,
    'num_classes': 1000
}

📦 Installation

# Clone the repository
git clone https://github.com/ilyasoulk/mini-vit.git
cd mini-vit

# Install requirements
pip install -r requirements.txt

🚀 Usage

Train the model:

python src/train.py

📊 Results

Performance on MNIST:

Test Accuracy: 98%
Number of Parameters: 11 million
Training Time: 2 Epochs on M2 Mac

Performance on CIFAR-10:

Test Accuracy: 85%
Number of Parameters: 11 million
Training Time: 50 Epochs on NVIDIA T4

💡 Key Features

Clean Implementation: Each component of the Vision Transformer is implemented with clear, documented code:
- Patch Embedding
- Multi-Head Self-Attention
- MLP Block
- Position Embeddings
Modifications for Smaller Scale:
- Smaller patch size (4x4 instead of 16x16)
- Reduced model dimensions
- Fewer attention heads
- Fewer transformer blocks
Training Optimizations:
- AdamW optimizer
- Learning rate scheduling
- Data augmentation for CIFAR-10

📁 Project Structure

mini-vit/
├── src/
│   ├── model.py    # ViT model implementation
│   └── train.py    # Training script
├── requirements.txt
└── README.md

🔍 Model Details

The implementation includes several key components:

Patch Embedding:
- Divides input images into 4x4 patches
- Projects patches to embedding dimension
- Adds learnable position embeddings
Transformer Encoder:
- Multi-head self-attention mechanism
- Layer normalization
- MLP block with GELU activation
- Residual connections

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-ViT: An Image is Worth 4x4 Words

🎯 Project Goals

🏗️ Model Architecture

ViT-Tiny

ViT-Small (Current Implementation)

📦 Installation

🚀 Usage

📊 Results

💡 Key Features

📁 Project Structure

🔍 Model Details

🛠️ Technical Details

🔗 References

⭐️ Show your support

About

Releases

Packages

Languages

License

ilyasoulk/mini-vit

Folders and files

Latest commit

History

Repository files navigation

Mini-ViT: An Image is Worth 4x4 Words

🎯 Project Goals

🏗️ Model Architecture

ViT-Tiny

ViT-Small (Current Implementation)

📦 Installation

🚀 Usage

📊 Results

💡 Key Features

📁 Project Structure

🔍 Model Details

🛠️ Technical Details

🔗 References

⭐️ Show your support

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages