DumbDuck 🦆

A GPT language model implementation built from scratch in PyTorch for educational purposes. This project demonstrates the complete pipeline of training a transformer-based language model, from data preprocessing to text generation.

🌟 Features

Custom GPT Architecture: Transformer-based model with configurable layers, heads, and embedding dimensions
BPE Tokenization: Byte-Pair Encoding tokenizer training and validation
Efficient Data Pipeline: Memory-mapped binary files for fast data loading
Mixed Precision Training: Automatic mixed precision (AMP) support for faster training
Flexible Text Generation: Support for top-k and top-p (nucleus) sampling
Checkpoint Management: Automatic saving of best models and periodic checkpoints
Learning Rate Scheduling: Warmup and cosine decay scheduling

📁 Project Structure

DumbDuck/
├── data/                      # Data processing scripts
│   ├── data.py               # Download WikiText-103 dataset
│   ├── clean.py              # Text cleaning and preprocessing
│   ├── shard.py              # Split corpus into train/val shards
│   └── train_val_binarize.py # Convert text to binary format
├── model/                     # Model implementation
│   ├── config.py             # Model configuration
│   ├── model.py              # GPT architecture
│   ├── dataset.py            # Dataset loader
│   ├── train.py              # Training script
│   ├── infer.py              # Inference script
│   └── utils.py              # Utility functions
├── tokenizer/                 # Tokenizer files
│   ├── tokenizer.py          # BPE tokenizer training
│   ├── validate_tokenizer.py # Tokenizer validation
│   └── bpe_tokenizer.json    # Trained tokenizer
├── checkpoints/               # Model checkpoints
└── data_shards/              # Binary training data

🚀 Quick Start

Prerequisites

Python 3.8+
PyTorch 2.0+
CUDA (optional, for GPU training)

Installation

Clone the repository:

git clone https://github.com/divshekhar/DumbDuck.git
cd DumbDuck

Install dependencies:

pip install -r requirements.txt

Data Preparation

Download the dataset (WikiText-103):

cd data
python data.py

Clean the corpus:

python clean.py

Create train/validation shards:

python shard.py

Binarize the data:

python train_val_binarize.py

Tokenizer Training

Train the BPE tokenizer on your corpus:

cd tokenizer
python tokenizer.py

Validate the tokenizer:

python validate_tokenizer.py

Training

Train the model with default configuration:

cd model
python train.py

Training arguments:

--shards_dir: Path to binary data shards (default: ../data_shards)
--tokenizer: Path to tokenizer file (default: ../tokenizer/bpe_tokenizer.json)
--out_dir: Output directory for checkpoints (default: ../checkpoints)
--batch_size: Batch size (default: 2)
--max_steps: Maximum training steps (default: 1000)
--eval_interval: Evaluation interval (default: 100)
--lr: Learning rate (default: 3e-4)
--warmup_steps: Warmup steps (default: 500)

Inference

Generate text using a trained model:

cd model
python infer.py --prompt "Once upon a time" --max_new_tokens 128

Inference arguments:

--ckpt: Path to checkpoint (default: ./checkpoints/final.pt)
--tokenizer: Path to tokenizer (default: ./tokenizer/bpe_tokenizer.json)
--prompt: Input prompt for generation
--max_new_tokens: Number of tokens to generate (default: 128)
--temperature: Sampling temperature (default: 0.8)
--top_k: Top-k sampling parameter (default: 50)
--top_p: Top-p (nucleus) sampling parameter (default: 0.95)

⚙️ Model Configuration

Default configuration (in model/train.py):

GPTConfig(
    vocab_size=50257,      # GPT-2 vocabulary size
    block_size=1024,       # Context window size
    n_layer=4,             # Number of transformer layers
    n_head=4,              # Number of attention heads
    n_embd=128,            # Embedding dimension
    bias=True,             # Use bias in linear layers
    use_swiGLU=False,      # Use SwiGLU activation
    attn_pdrop=0.1,        # Attention dropout
    resid_pdrop=0.1,       # Residual dropout
    embd_pdrop=0.1,        # Embedding dropout
)

📊 Training Details

Dataset: WikiText-103
Optimizer: AdamW with weight decay
Learning Rate Schedule: Linear warmup + cosine decay
Mixed Precision: Enabled by default (AMP)
Gradient Clipping: Max norm of 1.0
Checkpointing: Saves best model and periodic checkpoints

🎯 Example Output

$ python infer.py --prompt "The future of artificial intelligence"
The future of artificial intelligence is a topic of great debate...

🛠️ Customization

To modify the model architecture, edit the GPTConfig in model/train.py:

Increase n_layer and n_embd for larger models
Adjust block_size for longer context windows
Enable use_swiGLU for SwiGLU activation function

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Inspired by Andrej Karpathy's nanoGPT
Built on PyTorch and Hugging Face tokenizers
Trained on WikiText-103 dataset

🤝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
model		model
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DumbDuck 🦆

🌟 Features

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Data Preparation

Tokenizer Training

Training

Inference

⚙️ Model Configuration

📊 Training Details

🎯 Example Output

🛠️ Customization

📝 License

🙏 Acknowledgments

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

divshekhar/DumbDuck

Folders and files

Latest commit

History

Repository files navigation

DumbDuck 🦆

🌟 Features

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Data Preparation

Tokenizer Training

Training

Inference

⚙️ Model Configuration

📊 Training Details

🎯 Example Output

🛠️ Customization

📝 License

🙏 Acknowledgments

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages