A GPT language model implementation built from scratch in PyTorch for educational purposes. This project demonstrates the complete pipeline of training a transformer-based language model, from data preprocessing to text generation.
- Custom GPT Architecture: Transformer-based model with configurable layers, heads, and embedding dimensions
- BPE Tokenization: Byte-Pair Encoding tokenizer training and validation
- Efficient Data Pipeline: Memory-mapped binary files for fast data loading
- Mixed Precision Training: Automatic mixed precision (AMP) support for faster training
- Flexible Text Generation: Support for top-k and top-p (nucleus) sampling
- Checkpoint Management: Automatic saving of best models and periodic checkpoints
- Learning Rate Scheduling: Warmup and cosine decay scheduling
DumbDuck/
├── data/ # Data processing scripts
│ ├── data.py # Download WikiText-103 dataset
│ ├── clean.py # Text cleaning and preprocessing
│ ├── shard.py # Split corpus into train/val shards
│ └── train_val_binarize.py # Convert text to binary format
├── model/ # Model implementation
│ ├── config.py # Model configuration
│ ├── model.py # GPT architecture
│ ├── dataset.py # Dataset loader
│ ├── train.py # Training script
│ ├── infer.py # Inference script
│ └── utils.py # Utility functions
├── tokenizer/ # Tokenizer files
│ ├── tokenizer.py # BPE tokenizer training
│ ├── validate_tokenizer.py # Tokenizer validation
│ └── bpe_tokenizer.json # Trained tokenizer
├── checkpoints/ # Model checkpoints
└── data_shards/ # Binary training data
- Python 3.8+
- PyTorch 2.0+
- CUDA (optional, for GPU training)
- Clone the repository:
git clone https://github.com/divshekhar/DumbDuck.git
cd DumbDuck- Install dependencies:
pip install -r requirements.txt- Download the dataset (WikiText-103):
cd data
python data.py- Clean the corpus:
python clean.py- Create train/validation shards:
python shard.py- Binarize the data:
python train_val_binarize.pyTrain the BPE tokenizer on your corpus:
cd tokenizer
python tokenizer.pyValidate the tokenizer:
python validate_tokenizer.pyTrain the model with default configuration:
cd model
python train.pyTraining arguments:
--shards_dir: Path to binary data shards (default:../data_shards)--tokenizer: Path to tokenizer file (default:../tokenizer/bpe_tokenizer.json)--out_dir: Output directory for checkpoints (default:../checkpoints)--batch_size: Batch size (default: 2)--max_steps: Maximum training steps (default: 1000)--eval_interval: Evaluation interval (default: 100)--lr: Learning rate (default: 3e-4)--warmup_steps: Warmup steps (default: 500)
Generate text using a trained model:
cd model
python infer.py --prompt "Once upon a time" --max_new_tokens 128Inference arguments:
--ckpt: Path to checkpoint (default:./checkpoints/final.pt)--tokenizer: Path to tokenizer (default:./tokenizer/bpe_tokenizer.json)--prompt: Input prompt for generation--max_new_tokens: Number of tokens to generate (default: 128)--temperature: Sampling temperature (default: 0.8)--top_k: Top-k sampling parameter (default: 50)--top_p: Top-p (nucleus) sampling parameter (default: 0.95)
Default configuration (in model/train.py):
GPTConfig(
vocab_size=50257, # GPT-2 vocabulary size
block_size=1024, # Context window size
n_layer=4, # Number of transformer layers
n_head=4, # Number of attention heads
n_embd=128, # Embedding dimension
bias=True, # Use bias in linear layers
use_swiGLU=False, # Use SwiGLU activation
attn_pdrop=0.1, # Attention dropout
resid_pdrop=0.1, # Residual dropout
embd_pdrop=0.1, # Embedding dropout
)- Dataset: WikiText-103
- Optimizer: AdamW with weight decay
- Learning Rate Schedule: Linear warmup + cosine decay
- Mixed Precision: Enabled by default (AMP)
- Gradient Clipping: Max norm of 1.0
- Checkpointing: Saves best model and periodic checkpoints
$ python infer.py --prompt "The future of artificial intelligence"
The future of artificial intelligence is a topic of great debate...To modify the model architecture, edit the GPTConfig in model/train.py:
- Increase
n_layerandn_embdfor larger models - Adjust
block_sizefor longer context windows - Enable
use_swiGLUfor SwiGLU activation function
This project is open source and available under the MIT License.
- Inspired by Andrej Karpathy's nanoGPT
- Built on PyTorch and Hugging Face tokenizers
- Trained on WikiText-103 dataset
Contributions are welcome! Feel free to open issues or submit pull requests.