Skip to content

jasonacox/jojo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jojo: Building an LLM from Scratch

This project aims to build a large language model (LLM) from scratch, inspired by the GPT-2 architecture. We will begin training on the TinyStories dataset and gradually expand to include LLM-generated educational content. Over time, the model will be enhanced to better understand language and interaction.

image

Goals

  • Implement a GPT-2 style model in PyTorch.
  • Start with the TinyStories dataset for initial training.
  • Incrementally add more diverse and educational data.
  • Grow the model’s capabilities for language understanding and interaction.

HuggingFace Test

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jasonacox/jojo-124M")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text
input_text = "\n"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Setup

  1. Create a Python virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt
  3. Prepare your dataset: Run the following command to tokenize and prepare all datasets for training using the extended tokenizer:

    python prepare.py

    You can also process a specific dataset (e.g., chat, chitchat, or story):

    python prepare.py --dataset chat

    Available datasets:

    • chat: Human-Assistant formatted conversations for assistant-like interactions
    • chitchat: Simple greetings and short exchanges for basic interactions
    • story: The TinyStories dataset for simple stories
    • knowledge: General knowledge Q&A pairs for factual responses
    • dictionary: Word definitions for vocabulary and language understanding

    This will process all relevant .txt files in the data/ directory and create corresponding .bin files for efficient training.

  4. Train the model: The training script now features a modern, modular architecture with enhanced performance and monitoring capabilities.

    Basic Usage:

    python train.py --dataset chitchat --epochs 1

    Advanced Configuration:

    # Use a configuration file
    python train.py --config configs/my_config.json
    
    # Override specific parameters
    python train.py --dataset chitchat --batch_size 16 --learning_rate 1e-4
    
    # Resume training from checkpoint
    python train.py --checkpoint models/model.pt --resume
    
    # Save configuration for reuse
    python train.py --save_config configs/my_setup.json

    Available Command Line Options:

    • --dataset: Dataset to train on (story, dailydialog, chat, chitchat, knowledge, dictionary)
    • --epochs: Number of training epochs (default: 1)
    • --batch_size: Batch size for training (default: 12)
    • --learning_rate: Learning rate (default: 6e-4)
    • --config: Load settings from JSON configuration file
    • --checkpoint: Path to checkpoint file to resume from
    • --load_model_only: Load only model weights (no optimizer/scheduler state)
    • --output_checkpoint: Custom output checkpoint path
    • --resume: Resume training from checkpoint (preserves epoch counter)
    • --device: Specify device (cuda:0, cuda:1, cpu, etc.)
    • --save_config: Save current configuration to JSON file
    • --show_config: Display all configuration settings before training
    • --eval_interval: Batch interval for validation (default: 50)
    • --log_interval: Batch interval for progress logging (default: 50)
    • --checkpoint_interval: Batch interval for checkpoint saving (0 = epoch end only)
    • --no_cache: Disable tokenization caching
    • --debug: Enable debug logging
    • --seed: Random seed for reproducibility (default: 1337)
    • --version: Show version information and exit

    Key Features:

    • 🚀 Fast Startup: Pre-tokenized data caching reduces startup time by 20-40%
    • 📊 Rich Progress Tracking: Real-time progress bars with ETA, loss, learning rate, and MFU
    • 🎯 Smart Device Selection: Automatically selects GPU with most free memory
    • 💾 Robust Checkpointing: Automatic checkpoint saving with graceful shutdown on Ctrl+C
    • ⚙️ Configuration Management: JSON-based config files with validation and inheritance
    • 🔄 Resume Training: Seamlessly continue training from any checkpoint
    • 📈 Loss Plotting: Automatic generation of loss curves as PNG files
    • 🔧 PyTorch Optimizations: TF32, pin_memory, memory_fraction, and autocast support
    • 📝 Training Summaries: Comprehensive training setup and progress reports
    • ⏱️ Batch-interval Checkpointing: Configurable checkpoint saving frequency

    Performance Improvements:

    • 20-40% faster training through intelligent caching
    • 10-15% memory reduction via optimized tensor operations
    • 15-25% faster data loading with pre-tokenization
    • Enhanced batch processing for better GPU utilization

    Example Training Session:

    python train.py --dataset chitchat --epochs 1
    ╔═══════════════════════════════════════════════════════╗
    ║                Jojo LLM Training Program              ║
    ║                  Refactored Version                   ║
    ╚═══════════════════════════════════════════════════════╝
    
    CUDA devices available:
      [0] NVIDIA GeForce RTX 3090 - Free: 19.97 GB / Total: 23.57 GB
    
    ╔═══════════════════════════════════════════════════════╗
    ║                TRAINING CONFIGURATION                 ║
    ╚═══════════════════════════════════════════════════════╝
    Dataset:          chitchat
    Epochs:           1
    Batch size:       12
    Learning rate:    0.0006
    Device:           cuda:0
    Precision:        bfloat16
    
    [████████████████████████████████████████] Epoch 1/1 | 
    Batch 222/222 | 100.0% | Loss: 0.2567 | LR: 5.99e-04 | 
    ETA: Complete! | Samples/s: 1.7 | MFU: 18.8%
    
    Training completed successfully!
    
  5. Generate text: After training, you can generate text using your trained model:

    python gen.py [model_checkpoint] [options]

    Generation options:

    • Default model path is models/story5000.pt if not specified
    • --nonstop: Generate text continuously without a token limit
    • --prompt "Your text here": Specify a custom starting prompt
    • --prompt_file file.txt: Load a prompt from a file
    • --interactive: Enter interactive mode for multiple prompts
    • --seed 1234: Set random seed for reproducible generation
    • --temp 0.8: Set temperature (lower = more focused, higher = more random)
    • --max_tokens 500: Set maximum number of tokens to generate
    • --device cuda:0: Specify device (cuda:N, cpu, mps, or auto-detect if not specified)
    • --dtype float16: Choose precision (float32, bfloat16, float16)
    • --top_k 50: Limit vocabulary to top K options per token
    • --no_delay: Disable token generation delay for faster output
    • --output filename.txt: Save generated text to a file
    • --verbose: Show detailed model information

Special Tokens and Chat Format

This project uses the ChatML format with the following special tokens:

  • <|im_start|>user ...content... <|im_end|> - User messages
  • <|im_start|>assistant ...content... <|im_end|> - Assistant messages
  • <|im_start|>system ...content... <|im_end|> - System instructions
  • <|endoftext|> - Conversation separator

Extended Tokenizer

The project includes an extended tokenizer that properly handles these special tokens as single tokens rather than multiple tokens. This approach:

  • Improves token efficiency by ~40% for conversational data
  • Helps the model better understand conversation structure
  • Enables more efficient training and generation

To use the extended tokenizer:

from setup_tokenizer import get_extended_tokenizer

# Get the extended tokenizer
enc = get_extended_tokenizer()

# Encode text with special tokens (always use allowed_special="all")
tokens = enc.encode("Hello <|im_start|>user text <|im_end|>", allowed_special="all")

The extended tokenizer is automatically used in:

  • The data preparation scripts
  • The generation script (gen.py) when using chat mode

Using Chat Mode

To use the interactive chat mode with proper formatting:

python gen.py models/your_model.pt --chat

This will start an interactive chat session that:

  • Properly formats messages using the special tokens
  • Handles conversation history
  • Uses the extended tokenizer for efficient token processing

For more information on implementation details, see the setup_tokenizer.py and tokenizer_demo.py files.

image

Architecture Overview

The training system is built with a modular architecture for better maintainability and performance:

Core Modules

  • train.py - Main training script with argument parsing and orchestration
  • config.py - Configuration management with dataclasses and JSON support
  • trainer.py - Core training loop with checkpointing and evaluation
  • data_loader.py - Optimized data loading with pre-tokenization caching
  • utils.py - Utility functions for progress tracking, metrics, and device management

Configuration System

Training parameters are managed through a hierarchical configuration system:

```python
# Example configuration structure
{
  "model": {
    "n_layer": 12,
    "n_head": 12, 
    "n_embd": 768,
    "block_size": 1024,
    "dropout": 0.2,
    "bias": false,
    "vocab_size": 50304
  },
  "training": {
    "max_epochs": 1,
    "batch_size": 12,
    "eval_interval": 50,
    "log_interval": 50,
    "checkpoint_interval": 20,
    "gradient_accumulation_steps": 40,
    "eval_iters": 200,
    "save_checkpoints": true,
    "compile_model": true
  },
  "optimizer": {
    "learning_rate": 0.0006,
    "weight_decay": 0.1,
    "beta1": 0.9,
    "beta2": 0.95,
    "grad_clip": 1.0
  },
  "scheduler": {
    "decay_lr": true,
    "warmup_iters": 2000,
    "lr_decay_iters": 60000,
    "min_lr": 6e-05,
    "warmup_fraction": 0.1,
    "cooldown_fraction": 0.9
  },
  "system": {
    "device": "cuda",
    "dtype": "bfloat16",
    "seed": 1337,
    "num_workers": 8,
    "pin_memory": true,
    "memory_fraction": 0.9,
    "optimize_memory": true,
    "allow_tf32_matmul": true,
    "allow_tf32_cudnn": true
  },
  "data": {
    "dataset_name": "chitchat",
    "data_dir": "data/",
    "cache_tokenized": true,
    "cache_dir": "cache/"
  }
}

Create custom configurations in the `configs/` directory and load them with:
```bash
python train.py --config configs/my_config.json

Performance Features

  • High-Efficiency Packed Data Loader: Revolutionary packed data loading achieving >98% token utilization vs. 5-6% in traditional approaches (17x improvement)
  • Intelligent Batch Packing: Conversations are concatenated and split optimally to fill batches completely with real training data
  • Pre-tokenization Caching: Datasets are tokenized once and cached for subsequent runs
  • Efficient Memory Management: Optimized tensor operations and memory reuse with configurable memory fraction
  • Smart Device Selection: Automatically selects the GPU with most available memory
  • Gradient Accumulation: Support for effective larger batch sizes on limited hardware
  • Mixed Precision Training: Automatic FP16/BF16 support for faster training
  • TF32 Optimizations: Modern GPU acceleration with configurable TF32 support for matmul and cuDNN
  • Pin Memory & Non-blocking Transfer: Optimized CPU-GPU data transfer for CUDA devices
  • Loss Plotting: Automatic generation of training and validation loss curves as PNG files
  • MFU Tracking: Real-time Model FLOPs Utilization monitoring for performance optimization
  • Batch-interval Checkpointing: Configurable checkpoint saving frequency for long training runs

File Overview

Core Training System

  • train.py: Main training script with modular architecture and rich progress tracking
  • config.py: Configuration management system with JSON support and validation
  • trainer.py: Core training loop with robust checkpointing, evaluation, and loss plotting
  • simple_packed_loader.py: High-efficiency packed data loader achieving >98% token utilization
  • data_loader.py: Legacy data loading with pre-tokenization caching (superseded by packed loader)
  • utils.py: Utility classes for progress tracking, metrics, device management, and plotting
  • model.py: GPT model architecture and layers

Data and Generation

  • gen.py: Text generation with interactive chat mode and flexible options
  • setup_tokenizer.py: Extended tokenizer with special token support for ChatML format
  • prepare.py: Dataset preprocessing utility using the extended tokenizer

Legacy and Reference

  • train_old.py: Original training script (preserved for reference)
  • improvements_guide.py: Migration guide and feature comparison

Performance and Analysis Tools

  • test_mfu_optimization.py: MFU (Model FLOPs Utilization) testing and optimization
  • test_memory_limits.py: Memory usage analysis and GPU capacity testing
  • analyze_mfu.py: Comprehensive MFU analysis and recommendations
  • regenerate_plot.py: Standalone plot generation for existing checkpoints

Model Conversion and Distribution

  • upload_to_huggingface.py: Convert and upload Jojo models to Hugging Face Hub

Data Preparation Scripts

  • data/prepare-story.py: Prepare the TinyStories dataset for training (download, tokenize and convert to binary format).
  • data/prepare-chat.py, data/prepare-chitchat.py: Prepare conversational datasets with ChatML formatting.
  • data/prepare-knowledge.py: Prepares a general knowledge Q&A dataset using SQuAD and optionally a local LLM for answer generation or reformatting, with robust retry capability.

Development and Testing

  • testing_tools/: Directory containing additional testing and development utilities
  • test_extended_tokenizer.py: Test suite for the extended tokenizer
  • examples/: Example files and documentation for various features
  • upload_to_huggingface.py: Convert and upload trained models to Hugging Face Hub

Quick Start

  1. Setup environment:

    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. Prepare data:

    python prepare.py --dataset chitchat
  3. Start training:

    # Basic training
    python train.py --dataset chitchat --epochs 1
    
    # Show configuration before training
    python train.py --dataset chitchat --epochs 1 --show_config
    
    # Use a configuration file
    python train.py --config configs/story-small.json
    
    # Training with custom checkpoint intervals
    python train.py --dataset chitchat --epochs 1 --checkpoint_interval 50
  4. Monitor training:

    • Real-time progress bars show loss, learning rate, ETA, and MFU
    • Loss plots automatically generated as PNG files in the models/ directory
    • Training summaries provide comprehensive setup and performance information
  5. Generate text:

    python gen.py models/chitchat_epoch1.pt --chat
  6. Upload to Hugging Face (Optional):

    # Convert and upload trained model to Hugging Face Hub
    python upload_to_huggingface.py models/chitchat_epoch1.pt \
      --repo-name my-jojo-model \
      --dataset chitchat

Hugging Face Integration

The project includes a comprehensive script for converting and uploading trained Jojo models to Hugging Face Hub.

Converting Models to Hugging Face Format

The upload_to_huggingface.py script automatically converts Jojo checkpoints to Hugging Face Transformers format:

# Basic upload (public repository)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset story

# Upload to organization (private repository)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset chitchat \
  --organization my-org \
  --private

# Dry run (convert but don't upload)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset story \
  --dry-run

Features

  • Automatic Conversion: Converts Jojo checkpoints to standard Hugging Face format
  • Model Card Generation: Creates comprehensive model cards with training details, metrics, and usage examples
  • Tokenizer Integration: Includes proper tokenizer configuration (standard GPT-2 or extended)
  • Metadata Preservation: Preserves training configuration, metrics, and model architecture details
  • Repository Management: Creates repositories and handles uploads automatically
  • Safety Options: Dry-run mode for testing conversion without uploading

Command-Line Options

  • checkpoint: Path to Jojo checkpoint file (required)
  • --repo-name: Name for the Hugging Face repository (required)
  • --dataset: Name of training dataset for documentation (required)
  • --organization: Hugging Face organization (optional)
  • --private: Create private repository (default: public)
  • --tokenizer: Tokenizer type (gpt2 or extended, default: gpt2)
  • --dry-run: Convert model but don't upload
  • --output-dir: Local directory for converted model (default: ./hf_model)
  • --commit-message: Custom commit message for upload

Prerequisites

Install Hugging Face libraries:

pip install transformers huggingface_hub

Authenticate with Hugging Face:

huggingface-cli login

Using Uploaded Models

Once uploaded, models can be used with standard Hugging Face tools:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load your uploaded model
model = GPT2LMHeadModel.from_pretrained("username/my-jojo-model")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text
input_text = "Once upon a time"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Migration from Original Script

If you're upgrading from the original training script, see improvements_guide.py for a comprehensive migration guide:

python improvements_guide.py

The new system provides:

  • 20-40% faster training through pre-tokenization
  • Enhanced progress tracking with MFU and ETA
  • Robust configuration management
  • Improved error handling and graceful shutdown
  • Better checkpoint management with metadata

Notes

  • Modern Architecture: The training system has been completely refactored with a modular design for better maintainability and performance.
  • Smart Caching: Pre-tokenized datasets are cached automatically for 20-40% faster subsequent training runs.
  • Robust Checkpointing: Automatic checkpoint saving with graceful shutdown handling (Ctrl+C saves progress). Configurable batch-interval checkpointing for long training runs.
  • Configuration Management: Use JSON configuration files for reproducible training setups. Display all settings with --show_config.
  • Enhanced Monitoring: Real-time progress bars show ETA, samples/sec, MFU, loss, and learning rate. Automatic loss curve plotting.
  • PyTorch Optimizations: Full support for TF32, pin_memory, memory_fraction, and other modern PyTorch optimizations.
  • Device Management: Automatic GPU selection based on available memory, with manual override options.
  • Resume Training: Seamlessly continue training from any checkpoint with preserved or reset epoch counters.
  • MFU Analysis: Built-in Model FLOPs Utilization tracking and optimization tools for performance tuning.
  • PyTorch 2.6+ Compatibility: Full compatibility with latest PyTorch versions.
  • Hugging Face Integration: Professional model conversion and upload to Hugging Face Hub with automatic model cards.
  • Backward Compatibility: Original training script preserved as train_old.py for reference.
  • All dependencies are listed in requirements.txt.

For detailed information about improvements and migration, run:

python improvements_guide.py

About

Create a small personal LLM from scratch.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages