Skip to content

divshekhar/DumbDuck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DumbDuck 🦆

A GPT language model implementation built from scratch in PyTorch for educational purposes. This project demonstrates the complete pipeline of training a transformer-based language model, from data preprocessing to text generation.

🌟 Features

  • Custom GPT Architecture: Transformer-based model with configurable layers, heads, and embedding dimensions
  • BPE Tokenization: Byte-Pair Encoding tokenizer training and validation
  • Efficient Data Pipeline: Memory-mapped binary files for fast data loading
  • Mixed Precision Training: Automatic mixed precision (AMP) support for faster training
  • Flexible Text Generation: Support for top-k and top-p (nucleus) sampling
  • Checkpoint Management: Automatic saving of best models and periodic checkpoints
  • Learning Rate Scheduling: Warmup and cosine decay scheduling

📁 Project Structure

DumbDuck/
├── data/                      # Data processing scripts
│   ├── data.py               # Download WikiText-103 dataset
│   ├── clean.py              # Text cleaning and preprocessing
│   ├── shard.py              # Split corpus into train/val shards
│   └── train_val_binarize.py # Convert text to binary format
├── model/                     # Model implementation
│   ├── config.py             # Model configuration
│   ├── model.py              # GPT architecture
│   ├── dataset.py            # Dataset loader
│   ├── train.py              # Training script
│   ├── infer.py              # Inference script
│   └── utils.py              # Utility functions
├── tokenizer/                 # Tokenizer files
│   ├── tokenizer.py          # BPE tokenizer training
│   ├── validate_tokenizer.py # Tokenizer validation
│   └── bpe_tokenizer.json    # Trained tokenizer
├── checkpoints/               # Model checkpoints
└── data_shards/              # Binary training data

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA (optional, for GPU training)

Installation

  1. Clone the repository:
git clone https://github.com/divshekhar/DumbDuck.git
cd DumbDuck
  1. Install dependencies:
pip install -r requirements.txt

Data Preparation

  1. Download the dataset (WikiText-103):
cd data
python data.py
  1. Clean the corpus:
python clean.py
  1. Create train/validation shards:
python shard.py
  1. Binarize the data:
python train_val_binarize.py

Tokenizer Training

Train the BPE tokenizer on your corpus:

cd tokenizer
python tokenizer.py

Validate the tokenizer:

python validate_tokenizer.py

Training

Train the model with default configuration:

cd model
python train.py

Training arguments:

  • --shards_dir: Path to binary data shards (default: ../data_shards)
  • --tokenizer: Path to tokenizer file (default: ../tokenizer/bpe_tokenizer.json)
  • --out_dir: Output directory for checkpoints (default: ../checkpoints)
  • --batch_size: Batch size (default: 2)
  • --max_steps: Maximum training steps (default: 1000)
  • --eval_interval: Evaluation interval (default: 100)
  • --lr: Learning rate (default: 3e-4)
  • --warmup_steps: Warmup steps (default: 500)

Inference

Generate text using a trained model:

cd model
python infer.py --prompt "Once upon a time" --max_new_tokens 128

Inference arguments:

  • --ckpt: Path to checkpoint (default: ./checkpoints/final.pt)
  • --tokenizer: Path to tokenizer (default: ./tokenizer/bpe_tokenizer.json)
  • --prompt: Input prompt for generation
  • --max_new_tokens: Number of tokens to generate (default: 128)
  • --temperature: Sampling temperature (default: 0.8)
  • --top_k: Top-k sampling parameter (default: 50)
  • --top_p: Top-p (nucleus) sampling parameter (default: 0.95)

⚙️ Model Configuration

Default configuration (in model/train.py):

GPTConfig(
    vocab_size=50257,      # GPT-2 vocabulary size
    block_size=1024,       # Context window size
    n_layer=4,             # Number of transformer layers
    n_head=4,              # Number of attention heads
    n_embd=128,            # Embedding dimension
    bias=True,             # Use bias in linear layers
    use_swiGLU=False,      # Use SwiGLU activation
    attn_pdrop=0.1,        # Attention dropout
    resid_pdrop=0.1,       # Residual dropout
    embd_pdrop=0.1,        # Embedding dropout
)

📊 Training Details

  • Dataset: WikiText-103
  • Optimizer: AdamW with weight decay
  • Learning Rate Schedule: Linear warmup + cosine decay
  • Mixed Precision: Enabled by default (AMP)
  • Gradient Clipping: Max norm of 1.0
  • Checkpointing: Saves best model and periodic checkpoints

🎯 Example Output

$ python infer.py --prompt "The future of artificial intelligence"
The future of artificial intelligence is a topic of great debate...

🛠️ Customization

To modify the model architecture, edit the GPTConfig in model/train.py:

  • Increase n_layer and n_embd for larger models
  • Adjust block_size for longer context windows
  • Enable use_swiGLU for SwiGLU activation function

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

  • Inspired by Andrej Karpathy's nanoGPT
  • Built on PyTorch and Hugging Face tokenizers
  • Trained on WikiText-103 dataset

🤝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

About

LLM from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages