Jojo: Building an LLM from Scratch

This project aims to build a large language model (LLM) from scratch, inspired by the GPT-2 architecture. We will begin training on the TinyStories dataset and gradually expand to include LLM-generated educational content. Over time, the model will be enhanced to better understand language and interaction.

Goals

Implement a GPT-2 style model in PyTorch.
Start with the TinyStories dataset for initial training.
Incrementally add more diverse and educational data.
Grow the model’s capabilities for language understanding and interaction.

HuggingFace Test

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jasonacox/jojo-124M")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text
input_text = "\n"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Setup

Create a Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Prepare your dataset: Run the following command to tokenize and prepare all datasets for training using the extended tokenizer:
```
python prepare.py
```
You can also process a specific dataset (e.g., chat, chitchat, or story):
```
python prepare.py --dataset chat
```
Available datasets:
- chat: Human-Assistant formatted conversations for assistant-like interactions
- chitchat: Simple greetings and short exchanges for basic interactions
- story: The TinyStories dataset for simple stories
- knowledge: General knowledge Q&A pairs for factual responses
- dictionary: Word definitions for vocabulary and language understanding
This will process all relevant .txt files in the data/ directory and create corresponding .bin files for efficient training.
Train the model: The training script now features a modern, modular architecture with enhanced performance and monitoring capabilities.

Basic Usage:
```
python train.py --dataset chitchat --epochs 1
```
Advanced Configuration:
```
# Use a configuration file
python train.py --config configs/my_config.json

# Override specific parameters
python train.py --dataset chitchat --batch_size 16 --learning_rate 1e-4

# Resume training from checkpoint
python train.py --checkpoint models/model.pt --resume

# Save configuration for reuse
python train.py --save_config configs/my_setup.json
```
Available Command Line Options:
- --dataset: Dataset to train on (story, dailydialog, chat, chitchat, knowledge, dictionary)
- --epochs: Number of training epochs (default: 1)
- --batch_size: Batch size for training (default: 12)
- --learning_rate: Learning rate (default: 6e-4)
- --config: Load settings from JSON configuration file
- --checkpoint: Path to checkpoint file to resume from
- --load_model_only: Load only model weights (no optimizer/scheduler state)
- --output_checkpoint: Custom output checkpoint path
- --resume: Resume training from checkpoint (preserves epoch counter)
- --device: Specify device (cuda:0, cuda:1, cpu, etc.)
- --save_config: Save current configuration to JSON file
- --show_config: Display all configuration settings before training
- --eval_interval: Batch interval for validation (default: 50)
- --log_interval: Batch interval for progress logging (default: 50)
- --checkpoint_interval: Batch interval for checkpoint saving (0 = epoch end only)
- --no_cache: Disable tokenization caching
- --debug: Enable debug logging
- --seed: Random seed for reproducibility (default: 1337)
- --version: Show version information and exit
Key Features:
- 🚀 Fast Startup: Pre-tokenized data caching reduces startup time by 20-40%
- 📊 Rich Progress Tracking: Real-time progress bars with ETA, loss, learning rate, and MFU
- 🎯 Smart Device Selection: Automatically selects GPU with most free memory
- 💾 Robust Checkpointing: Automatic checkpoint saving with graceful shutdown on Ctrl+C
- ⚙️ Configuration Management: JSON-based config files with validation and inheritance
- 🔄 Resume Training: Seamlessly continue training from any checkpoint
- 📈 Loss Plotting: Automatic generation of loss curves as PNG files
- 🔧 PyTorch Optimizations: TF32, pin_memory, memory_fraction, and autocast support
- 📝 Training Summaries: Comprehensive training setup and progress reports
- ⏱️ Batch-interval Checkpointing: Configurable checkpoint saving frequency
Performance Improvements:
- 20-40% faster training through intelligent caching
- 10-15% memory reduction via optimized tensor operations
- 15-25% faster data loading with pre-tokenization
- Enhanced batch processing for better GPU utilization
Example Training Session:
```
python train.py --dataset chitchat --epochs 1
```
```
╔═══════════════════════════════════════════════════════╗
║                Jojo LLM Training Program              ║
║                  Refactored Version                   ║
╚═══════════════════════════════════════════════════════╝

CUDA devices available:
  [0] NVIDIA GeForce RTX 3090 - Free: 19.97 GB / Total: 23.57 GB

╔═══════════════════════════════════════════════════════╗
║                TRAINING CONFIGURATION                 ║
╚═══════════════════════════════════════════════════════╝
Dataset:          chitchat
Epochs:           1
Batch size:       12
Learning rate:    0.0006
Device:           cuda:0
Precision:        bfloat16

[████████████████████████████████████████] Epoch 1/1 | 
Batch 222/222 | 100.0% | Loss: 0.2567 | LR: 5.99e-04 | 
ETA: Complete! | Samples/s: 1.7 | MFU: 18.8%

Training completed successfully!
```
Generate text: After training, you can generate text using your trained model:
```
python gen.py [model_checkpoint] [options]
```
Generation options:
- Default model path is models/story5000.pt if not specified
- --nonstop: Generate text continuously without a token limit
- --prompt "Your text here": Specify a custom starting prompt
- --prompt_file file.txt: Load a prompt from a file
- --interactive: Enter interactive mode for multiple prompts
- --seed 1234: Set random seed for reproducible generation
- --temp 0.8: Set temperature (lower = more focused, higher = more random)
- --max_tokens 500: Set maximum number of tokens to generate
- --device cuda:0: Specify device (cuda:N, cpu, mps, or auto-detect if not specified)
- --dtype float16: Choose precision (float32, bfloat16, float16)
- --top_k 50: Limit vocabulary to top K options per token
- --no_delay: Disable token generation delay for faster output
- --output filename.txt: Save generated text to a file
- --verbose: Show detailed model information

Special Tokens and Chat Format

This project uses the ChatML format with the following special tokens:

<|im_start|>user ...content... <|im_end|> - User messages
<|im_start|>assistant ...content... <|im_end|> - Assistant messages
<|im_start|>system ...content... <|im_end|> - System instructions
<|endoftext|> - Conversation separator

Extended Tokenizer

The project includes an extended tokenizer that properly handles these special tokens as single tokens rather than multiple tokens. This approach:

Improves token efficiency by ~40% for conversational data
Helps the model better understand conversation structure
Enables more efficient training and generation

To use the extended tokenizer:

from setup_tokenizer import get_extended_tokenizer

# Get the extended tokenizer
enc = get_extended_tokenizer()

# Encode text with special tokens (always use allowed_special="all")
tokens = enc.encode("Hello <|im_start|>user text <|im_end|>", allowed_special="all")

The extended tokenizer is automatically used in:

The data preparation scripts
The generation script (gen.py) when using chat mode

Using Chat Mode

To use the interactive chat mode with proper formatting:

python gen.py models/your_model.pt --chat

This will start an interactive chat session that:

Properly formats messages using the special tokens
Handles conversation history
Uses the extended tokenizer for efficient token processing

For more information on implementation details, see the setup_tokenizer.py and tokenizer_demo.py files.

Architecture Overview

The training system is built with a modular architecture for better maintainability and performance:

Core Modules

train.py - Main training script with argument parsing and orchestration
config.py - Configuration management with dataclasses and JSON support
trainer.py - Core training loop with checkpointing and evaluation
data_loader.py - Optimized data loading with pre-tokenization caching
utils.py - Utility functions for progress tracking, metrics, and device management

Configuration System

Training parameters are managed through a hierarchical configuration system:

```python
# Example configuration structure
{
  "model": {
    "n_layer": 12,
    "n_head": 12, 
    "n_embd": 768,
    "block_size": 1024,
    "dropout": 0.2,
    "bias": false,
    "vocab_size": 50304
  },
  "training": {
    "max_epochs": 1,
    "batch_size": 12,
    "eval_interval": 50,
    "log_interval": 50,
    "checkpoint_interval": 20,
    "gradient_accumulation_steps": 40,
    "eval_iters": 200,
    "save_checkpoints": true,
    "compile_model": true
  },
  "optimizer": {
    "learning_rate": 0.0006,
    "weight_decay": 0.1,
    "beta1": 0.9,
    "beta2": 0.95,
    "grad_clip": 1.0
  },
  "scheduler": {
    "decay_lr": true,
    "warmup_iters": 2000,
    "lr_decay_iters": 60000,
    "min_lr": 6e-05,
    "warmup_fraction": 0.1,
    "cooldown_fraction": 0.9
  },
  "system": {
    "device": "cuda",
    "dtype": "bfloat16",
    "seed": 1337,
    "num_workers": 8,
    "pin_memory": true,
    "memory_fraction": 0.9,
    "optimize_memory": true,
    "allow_tf32_matmul": true,
    "allow_tf32_cudnn": true
  },
  "data": {
    "dataset_name": "chitchat",
    "data_dir": "data/",
    "cache_tokenized": true,
    "cache_dir": "cache/"
  }
}


Create custom configurations in the `configs/` directory and load them with:
```bash
python train.py --config configs/my_config.json

Performance Features

High-Efficiency Packed Data Loader: Revolutionary packed data loading achieving >98% token utilization vs. 5-6% in traditional approaches (17x improvement)
Intelligent Batch Packing: Conversations are concatenated and split optimally to fill batches completely with real training data
Pre-tokenization Caching: Datasets are tokenized once and cached for subsequent runs
Efficient Memory Management: Optimized tensor operations and memory reuse with configurable memory fraction
Smart Device Selection: Automatically selects the GPU with most available memory
Gradient Accumulation: Support for effective larger batch sizes on limited hardware
Mixed Precision Training: Automatic FP16/BF16 support for faster training
TF32 Optimizations: Modern GPU acceleration with configurable TF32 support for matmul and cuDNN
Pin Memory & Non-blocking Transfer: Optimized CPU-GPU data transfer for CUDA devices
Loss Plotting: Automatic generation of training and validation loss curves as PNG files
MFU Tracking: Real-time Model FLOPs Utilization monitoring for performance optimization
Batch-interval Checkpointing: Configurable checkpoint saving frequency for long training runs

File Overview

Core Training System

train.py: Main training script with modular architecture and rich progress tracking
config.py: Configuration management system with JSON support and validation
trainer.py: Core training loop with robust checkpointing, evaluation, and loss plotting
simple_packed_loader.py: High-efficiency packed data loader achieving >98% token utilization
data_loader.py: Legacy data loading with pre-tokenization caching (superseded by packed loader)
utils.py: Utility classes for progress tracking, metrics, device management, and plotting
model.py: GPT model architecture and layers

Data and Generation

gen.py: Text generation with interactive chat mode and flexible options
setup_tokenizer.py: Extended tokenizer with special token support for ChatML format
prepare.py: Dataset preprocessing utility using the extended tokenizer

Legacy and Reference

train_old.py: Original training script (preserved for reference)
improvements_guide.py: Migration guide and feature comparison

Performance and Analysis Tools

test_mfu_optimization.py: MFU (Model FLOPs Utilization) testing and optimization
test_memory_limits.py: Memory usage analysis and GPU capacity testing
analyze_mfu.py: Comprehensive MFU analysis and recommendations
regenerate_plot.py: Standalone plot generation for existing checkpoints

Model Conversion and Distribution

upload_to_huggingface.py: Convert and upload Jojo models to Hugging Face Hub

Data Preparation Scripts

data/prepare-story.py: Prepare the TinyStories dataset for training (download, tokenize and convert to binary format).
data/prepare-chat.py, data/prepare-chitchat.py: Prepare conversational datasets with ChatML formatting.
data/prepare-knowledge.py: Prepares a general knowledge Q&A dataset using SQuAD and optionally a local LLM for answer generation or reformatting, with robust retry capability.

Development and Testing

testing_tools/: Directory containing additional testing and development utilities
test_extended_tokenizer.py: Test suite for the extended tokenizer
examples/: Example files and documentation for various features
upload_to_huggingface.py: Convert and upload trained models to Hugging Face Hub

Quick Start

Setup environment:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Prepare data:
```
python prepare.py --dataset chitchat
```

Start training:

# Basic training
python train.py --dataset chitchat --epochs 1

# Show configuration before training
python train.py --dataset chitchat --epochs 1 --show_config

# Use a configuration file
python train.py --config configs/story-small.json

# Training with custom checkpoint intervals
python train.py --dataset chitchat --epochs 1 --checkpoint_interval 50

Monitor training:
- Real-time progress bars show loss, learning rate, ETA, and MFU
- Loss plots automatically generated as PNG files in the models/ directory
- Training summaries provide comprehensive setup and performance information

Generate text:

python gen.py models/chitchat_epoch1.pt --chat

Upload to Hugging Face (Optional):

# Convert and upload trained model to Hugging Face Hub
python upload_to_huggingface.py models/chitchat_epoch1.pt \
  --repo-name my-jojo-model \
  --dataset chitchat

Hugging Face Integration

The project includes a comprehensive script for converting and uploading trained Jojo models to Hugging Face Hub.

Converting Models to Hugging Face Format

The upload_to_huggingface.py script automatically converts Jojo checkpoints to Hugging Face Transformers format:

# Basic upload (public repository)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset story

# Upload to organization (private repository)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset chitchat \
  --organization my-org \
  --private

# Dry run (convert but don't upload)
python upload_to_huggingface.py models/my_model.pt \
  --repo-name my-jojo-model \
  --dataset story \
  --dry-run

Features

Automatic Conversion: Converts Jojo checkpoints to standard Hugging Face format
Model Card Generation: Creates comprehensive model cards with training details, metrics, and usage examples
Tokenizer Integration: Includes proper tokenizer configuration (standard GPT-2 or extended)
Metadata Preservation: Preserves training configuration, metrics, and model architecture details
Repository Management: Creates repositories and handles uploads automatically
Safety Options: Dry-run mode for testing conversion without uploading

Command-Line Options

checkpoint: Path to Jojo checkpoint file (required)
--repo-name: Name for the Hugging Face repository (required)
--dataset: Name of training dataset for documentation (required)
--organization: Hugging Face organization (optional)
--private: Create private repository (default: public)
--tokenizer: Tokenizer type (gpt2 or extended, default: gpt2)
--dry-run: Convert model but don't upload
--output-dir: Local directory for converted model (default: ./hf_model)
--commit-message: Custom commit message for upload

Prerequisites

Install Hugging Face libraries:

pip install transformers huggingface_hub

Authenticate with Hugging Face:

huggingface-cli login

Using Uploaded Models

Once uploaded, models can be used with standard Hugging Face tools:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load your uploaded model
model = GPT2LMHeadModel.from_pretrained("username/my-jojo-model")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text
input_text = "Once upon a time"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Migration from Original Script

If you're upgrading from the original training script, see improvements_guide.py for a comprehensive migration guide:

python improvements_guide.py

The new system provides:

20-40% faster training through pre-tokenization
Enhanced progress tracking with MFU and ETA
Robust configuration management
Improved error handling and graceful shutdown
Better checkpoint management with metadata

Notes

Modern Architecture: The training system has been completely refactored with a modular design for better maintainability and performance.
Smart Caching: Pre-tokenized datasets are cached automatically for 20-40% faster subsequent training runs.
Robust Checkpointing: Automatic checkpoint saving with graceful shutdown handling (Ctrl+C saves progress). Configurable batch-interval checkpointing for long training runs.
Configuration Management: Use JSON configuration files for reproducible training setups. Display all settings with --show_config.
Enhanced Monitoring: Real-time progress bars show ETA, samples/sec, MFU, loss, and learning rate. Automatic loss curve plotting.
PyTorch Optimizations: Full support for TF32, pin_memory, memory_fraction, and other modern PyTorch optimizations.
Device Management: Automatic GPU selection based on available memory, with manual override options.
Resume Training: Seamlessly continue training from any checkpoint with preserved or reset epoch counters.
MFU Analysis: Built-in Model FLOPs Utilization tracking and optimization tools for performance tuning.
PyTorch 2.6+ Compatibility: Full compatibility with latest PyTorch versions.
Hugging Face Integration: Professional model conversion and upload to Hugging Face Hub with automatic model cards.
Backward Compatibility: Original training script preserved as train_old.py for reference.
All dependencies are listed in requirements.txt.

For detailed information about improvements and migration, run:

python improvements_guide.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Jojo: Building an LLM from Scratch

Goals

HuggingFace Test

Setup

Special Tokens and Chat Format

Extended Tokenizer

Using Chat Mode

Architecture Overview

Core Modules

Configuration System

Performance Features

File Overview

Core Training System

Data and Generation

Legacy and Reference

Performance and Analysis Tools

Model Conversion and Distribution

Data Preparation Scripts

Development and Testing

Quick Start

Hugging Face Integration

Converting Models to Hugging Face Format

Features

Command-Line Options

Prerequisites

Using Uploaded Models

Migration from Original Script

Notes

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
archive		archive
configs		configs
data		data
docs		docs
examples		examples
testing_tools		testing_tools
.gitignore		.gitignore
README.md		README.md
RELEASE.md		RELEASE.md
config.py		config.py
data_loader.py		data_loader.py
gen.py		gen.py
model.py		model.py
prepare.py		prepare.py
requirements.txt		requirements.txt
setup_tokenizer.py		setup_tokenizer.py
simple_packed_loader.py		simple_packed_loader.py
story-notebook.py		story-notebook.py
train.py		train.py
train_old.py		train_old.py
trainer.py		trainer.py
upload_to_huggingface.py		upload_to_huggingface.py
utils.py		utils.py
verify_data.py		verify_data.py
verify_packed_integration.py		verify_packed_integration.py
verify_packing.py		verify_packing.py
verify_token_zero.py		verify_token_zero.py

jasonacox/jojo

Folders and files

Latest commit

History

Repository files navigation

Jojo: Building an LLM from Scratch

Goals

HuggingFace Test

Setup

Special Tokens and Chat Format

Extended Tokenizer

Using Chat Mode

Architecture Overview

Core Modules

Configuration System

Performance Features

File Overview

Core Training System

Data and Generation

Legacy and Reference

Performance and Analysis Tools

Model Conversion and Distribution

Data Preparation Scripts

Development and Testing

Quick Start

Hugging Face Integration

Converting Models to Hugging Face Format

Features

Command-Line Options

Prerequisites

Using Uploaded Models

Migration from Original Script

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages