This project aims to build a large language model (LLM) from scratch, inspired by the GPT-2 architecture. We will begin training on the TinyStories dataset and gradually expand to include LLM-generated educational content. Over time, the model will be enhanced to better understand language and interaction.

- Implement a GPT-2 style model in PyTorch.
- Start with the TinyStories dataset for initial training.
- Incrementally add more diverse and educational data.
- Grow the model’s capabilities for language understanding and interaction.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jasonacox/jojo-124M")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Generate text
input_text = "\n"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
-
Create a Python virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Prepare your dataset: Run the following command to tokenize and prepare all datasets for training using the extended tokenizer:
python prepare.py
You can also process a specific dataset (e.g., chat, chitchat, or story):
python prepare.py --dataset chat
Available datasets:
chat
: Human-Assistant formatted conversations for assistant-like interactionschitchat
: Simple greetings and short exchanges for basic interactionsstory
: The TinyStories dataset for simple storiesknowledge
: General knowledge Q&A pairs for factual responsesdictionary
: Word definitions for vocabulary and language understanding
This will process all relevant .txt files in the data/ directory and create corresponding .bin files for efficient training.
-
Train the model: The training script now features a modern, modular architecture with enhanced performance and monitoring capabilities.
Basic Usage:
python train.py --dataset chitchat --epochs 1
Advanced Configuration:
# Use a configuration file python train.py --config configs/my_config.json # Override specific parameters python train.py --dataset chitchat --batch_size 16 --learning_rate 1e-4 # Resume training from checkpoint python train.py --checkpoint models/model.pt --resume # Save configuration for reuse python train.py --save_config configs/my_setup.json
Available Command Line Options:
--dataset
: Dataset to train on (story, dailydialog, chat, chitchat, knowledge, dictionary)--epochs
: Number of training epochs (default: 1)--batch_size
: Batch size for training (default: 12)--learning_rate
: Learning rate (default: 6e-4)--config
: Load settings from JSON configuration file--checkpoint
: Path to checkpoint file to resume from--load_model_only
: Load only model weights (no optimizer/scheduler state)--output_checkpoint
: Custom output checkpoint path--resume
: Resume training from checkpoint (preserves epoch counter)--device
: Specify device (cuda:0, cuda:1, cpu, etc.)--save_config
: Save current configuration to JSON file--show_config
: Display all configuration settings before training--eval_interval
: Batch interval for validation (default: 50)--log_interval
: Batch interval for progress logging (default: 50)--checkpoint_interval
: Batch interval for checkpoint saving (0 = epoch end only)--no_cache
: Disable tokenization caching--debug
: Enable debug logging--seed
: Random seed for reproducibility (default: 1337)--version
: Show version information and exit
Key Features:
- 🚀 Fast Startup: Pre-tokenized data caching reduces startup time by 20-40%
- 📊 Rich Progress Tracking: Real-time progress bars with ETA, loss, learning rate, and MFU
- 🎯 Smart Device Selection: Automatically selects GPU with most free memory
- 💾 Robust Checkpointing: Automatic checkpoint saving with graceful shutdown on Ctrl+C
- ⚙️ Configuration Management: JSON-based config files with validation and inheritance
- 🔄 Resume Training: Seamlessly continue training from any checkpoint
- 📈 Loss Plotting: Automatic generation of loss curves as PNG files
- 🔧 PyTorch Optimizations: TF32, pin_memory, memory_fraction, and autocast support
- 📝 Training Summaries: Comprehensive training setup and progress reports
- ⏱️ Batch-interval Checkpointing: Configurable checkpoint saving frequency
Performance Improvements:
- 20-40% faster training through intelligent caching
- 10-15% memory reduction via optimized tensor operations
- 15-25% faster data loading with pre-tokenization
- Enhanced batch processing for better GPU utilization
Example Training Session:
python train.py --dataset chitchat --epochs 1
╔═══════════════════════════════════════════════════════╗ ║ Jojo LLM Training Program ║ ║ Refactored Version ║ ╚═══════════════════════════════════════════════════════╝ CUDA devices available: [0] NVIDIA GeForce RTX 3090 - Free: 19.97 GB / Total: 23.57 GB ╔═══════════════════════════════════════════════════════╗ ║ TRAINING CONFIGURATION ║ ╚═══════════════════════════════════════════════════════╝ Dataset: chitchat Epochs: 1 Batch size: 12 Learning rate: 0.0006 Device: cuda:0 Precision: bfloat16 [████████████████████████████████████████] Epoch 1/1 | Batch 222/222 | 100.0% | Loss: 0.2567 | LR: 5.99e-04 | ETA: Complete! | Samples/s: 1.7 | MFU: 18.8% Training completed successfully!
-
Generate text: After training, you can generate text using your trained model:
python gen.py [model_checkpoint] [options]
Generation options:
- Default model path is
models/story5000.pt
if not specified --nonstop
: Generate text continuously without a token limit--prompt "Your text here"
: Specify a custom starting prompt--prompt_file file.txt
: Load a prompt from a file--interactive
: Enter interactive mode for multiple prompts--seed 1234
: Set random seed for reproducible generation--temp 0.8
: Set temperature (lower = more focused, higher = more random)--max_tokens 500
: Set maximum number of tokens to generate--device cuda:0
: Specify device (cuda:N, cpu, mps, or auto-detect if not specified)--dtype float16
: Choose precision (float32, bfloat16, float16)--top_k 50
: Limit vocabulary to top K options per token--no_delay
: Disable token generation delay for faster output--output filename.txt
: Save generated text to a file--verbose
: Show detailed model information
- Default model path is
This project uses the ChatML format with the following special tokens:
<|im_start|>user ...content... <|im_end|>
- User messages<|im_start|>assistant ...content... <|im_end|>
- Assistant messages<|im_start|>system ...content... <|im_end|>
- System instructions<|endoftext|>
- Conversation separator
The project includes an extended tokenizer that properly handles these special tokens as single tokens rather than multiple tokens. This approach:
- Improves token efficiency by ~40% for conversational data
- Helps the model better understand conversation structure
- Enables more efficient training and generation
To use the extended tokenizer:
from setup_tokenizer import get_extended_tokenizer
# Get the extended tokenizer
enc = get_extended_tokenizer()
# Encode text with special tokens (always use allowed_special="all")
tokens = enc.encode("Hello <|im_start|>user text <|im_end|>", allowed_special="all")
The extended tokenizer is automatically used in:
- The data preparation scripts
- The generation script (gen.py) when using chat mode
To use the interactive chat mode with proper formatting:
python gen.py models/your_model.pt --chat
This will start an interactive chat session that:
- Properly formats messages using the special tokens
- Handles conversation history
- Uses the extended tokenizer for efficient token processing
For more information on implementation details, see the setup_tokenizer.py
and
tokenizer_demo.py
files.

The training system is built with a modular architecture for better maintainability and performance:
train.py
- Main training script with argument parsing and orchestrationconfig.py
- Configuration management with dataclasses and JSON supporttrainer.py
- Core training loop with checkpointing and evaluationdata_loader.py
- Optimized data loading with pre-tokenization cachingutils.py
- Utility functions for progress tracking, metrics, and device management
Training parameters are managed through a hierarchical configuration system:
```python
# Example configuration structure
{
"model": {
"n_layer": 12,
"n_head": 12,
"n_embd": 768,
"block_size": 1024,
"dropout": 0.2,
"bias": false,
"vocab_size": 50304
},
"training": {
"max_epochs": 1,
"batch_size": 12,
"eval_interval": 50,
"log_interval": 50,
"checkpoint_interval": 20,
"gradient_accumulation_steps": 40,
"eval_iters": 200,
"save_checkpoints": true,
"compile_model": true
},
"optimizer": {
"learning_rate": 0.0006,
"weight_decay": 0.1,
"beta1": 0.9,
"beta2": 0.95,
"grad_clip": 1.0
},
"scheduler": {
"decay_lr": true,
"warmup_iters": 2000,
"lr_decay_iters": 60000,
"min_lr": 6e-05,
"warmup_fraction": 0.1,
"cooldown_fraction": 0.9
},
"system": {
"device": "cuda",
"dtype": "bfloat16",
"seed": 1337,
"num_workers": 8,
"pin_memory": true,
"memory_fraction": 0.9,
"optimize_memory": true,
"allow_tf32_matmul": true,
"allow_tf32_cudnn": true
},
"data": {
"dataset_name": "chitchat",
"data_dir": "data/",
"cache_tokenized": true,
"cache_dir": "cache/"
}
}
Create custom configurations in the `configs/` directory and load them with:
```bash
python train.py --config configs/my_config.json
- High-Efficiency Packed Data Loader: Revolutionary packed data loading achieving >98% token utilization vs. 5-6% in traditional approaches (17x improvement)
- Intelligent Batch Packing: Conversations are concatenated and split optimally to fill batches completely with real training data
- Pre-tokenization Caching: Datasets are tokenized once and cached for subsequent runs
- Efficient Memory Management: Optimized tensor operations and memory reuse with configurable memory fraction
- Smart Device Selection: Automatically selects the GPU with most available memory
- Gradient Accumulation: Support for effective larger batch sizes on limited hardware
- Mixed Precision Training: Automatic FP16/BF16 support for faster training
- TF32 Optimizations: Modern GPU acceleration with configurable TF32 support for matmul and cuDNN
- Pin Memory & Non-blocking Transfer: Optimized CPU-GPU data transfer for CUDA devices
- Loss Plotting: Automatic generation of training and validation loss curves as PNG files
- MFU Tracking: Real-time Model FLOPs Utilization monitoring for performance optimization
- Batch-interval Checkpointing: Configurable checkpoint saving frequency for long training runs
train.py
: Main training script with modular architecture and rich progress trackingconfig.py
: Configuration management system with JSON support and validationtrainer.py
: Core training loop with robust checkpointing, evaluation, and loss plottingsimple_packed_loader.py
: High-efficiency packed data loader achieving >98% token utilizationdata_loader.py
: Legacy data loading with pre-tokenization caching (superseded by packed loader)utils.py
: Utility classes for progress tracking, metrics, device management, and plottingmodel.py
: GPT model architecture and layers
gen.py
: Text generation with interactive chat mode and flexible optionssetup_tokenizer.py
: Extended tokenizer with special token support for ChatML formatprepare.py
: Dataset preprocessing utility using the extended tokenizer
train_old.py
: Original training script (preserved for reference)improvements_guide.py
: Migration guide and feature comparison
test_mfu_optimization.py
: MFU (Model FLOPs Utilization) testing and optimizationtest_memory_limits.py
: Memory usage analysis and GPU capacity testinganalyze_mfu.py
: Comprehensive MFU analysis and recommendationsregenerate_plot.py
: Standalone plot generation for existing checkpoints
upload_to_huggingface.py
: Convert and upload Jojo models to Hugging Face Hub
data/prepare-story.py
: Prepare the TinyStories dataset for training (download, tokenize and convert to binary format).data/prepare-chat.py
,data/prepare-chitchat.py
: Prepare conversational datasets with ChatML formatting.data/prepare-knowledge.py
: Prepares a general knowledge Q&A dataset using SQuAD and optionally a local LLM for answer generation or reformatting, with robust retry capability.
testing_tools/
: Directory containing additional testing and development utilitiestest_extended_tokenizer.py
: Test suite for the extended tokenizerexamples/
: Example files and documentation for various featuresupload_to_huggingface.py
: Convert and upload trained models to Hugging Face Hub
-
Setup environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Prepare data:
python prepare.py --dataset chitchat
-
Start training:
# Basic training python train.py --dataset chitchat --epochs 1 # Show configuration before training python train.py --dataset chitchat --epochs 1 --show_config # Use a configuration file python train.py --config configs/story-small.json # Training with custom checkpoint intervals python train.py --dataset chitchat --epochs 1 --checkpoint_interval 50
-
Monitor training:
- Real-time progress bars show loss, learning rate, ETA, and MFU
- Loss plots automatically generated as PNG files in the models/ directory
- Training summaries provide comprehensive setup and performance information
-
Generate text:
python gen.py models/chitchat_epoch1.pt --chat
-
Upload to Hugging Face (Optional):
# Convert and upload trained model to Hugging Face Hub python upload_to_huggingface.py models/chitchat_epoch1.pt \ --repo-name my-jojo-model \ --dataset chitchat
The project includes a comprehensive script for converting and uploading trained Jojo models to Hugging Face Hub.
The upload_to_huggingface.py
script automatically converts Jojo checkpoints to Hugging Face Transformers format:
# Basic upload (public repository)
python upload_to_huggingface.py models/my_model.pt \
--repo-name my-jojo-model \
--dataset story
# Upload to organization (private repository)
python upload_to_huggingface.py models/my_model.pt \
--repo-name my-jojo-model \
--dataset chitchat \
--organization my-org \
--private
# Dry run (convert but don't upload)
python upload_to_huggingface.py models/my_model.pt \
--repo-name my-jojo-model \
--dataset story \
--dry-run
- Automatic Conversion: Converts Jojo checkpoints to standard Hugging Face format
- Model Card Generation: Creates comprehensive model cards with training details, metrics, and usage examples
- Tokenizer Integration: Includes proper tokenizer configuration (standard GPT-2 or extended)
- Metadata Preservation: Preserves training configuration, metrics, and model architecture details
- Repository Management: Creates repositories and handles uploads automatically
- Safety Options: Dry-run mode for testing conversion without uploading
checkpoint
: Path to Jojo checkpoint file (required)--repo-name
: Name for the Hugging Face repository (required)--dataset
: Name of training dataset for documentation (required)--organization
: Hugging Face organization (optional)--private
: Create private repository (default: public)--tokenizer
: Tokenizer type (gpt2
orextended
, default:gpt2
)--dry-run
: Convert model but don't upload--output-dir
: Local directory for converted model (default:./hf_model
)--commit-message
: Custom commit message for upload
Install Hugging Face libraries:
pip install transformers huggingface_hub
Authenticate with Hugging Face:
huggingface-cli login
Once uploaded, models can be used with standard Hugging Face tools:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load your uploaded model
model = GPT2LMHeadModel.from_pretrained("username/my-jojo-model")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Generate text
input_text = "Once upon a time"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=100, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
If you're upgrading from the original training script, see improvements_guide.py
for a comprehensive migration guide:
python improvements_guide.py
The new system provides:
- 20-40% faster training through pre-tokenization
- Enhanced progress tracking with MFU and ETA
- Robust configuration management
- Improved error handling and graceful shutdown
- Better checkpoint management with metadata
- Modern Architecture: The training system has been completely refactored with a modular design for better maintainability and performance.
- Smart Caching: Pre-tokenized datasets are cached automatically for 20-40% faster subsequent training runs.
- Robust Checkpointing: Automatic checkpoint saving with graceful shutdown handling (Ctrl+C saves progress). Configurable batch-interval checkpointing for long training runs.
- Configuration Management: Use JSON configuration files for reproducible training setups. Display all settings with
--show_config
. - Enhanced Monitoring: Real-time progress bars show ETA, samples/sec, MFU, loss, and learning rate. Automatic loss curve plotting.
- PyTorch Optimizations: Full support for TF32, pin_memory, memory_fraction, and other modern PyTorch optimizations.
- Device Management: Automatic GPU selection based on available memory, with manual override options.
- Resume Training: Seamlessly continue training from any checkpoint with preserved or reset epoch counters.
- MFU Analysis: Built-in Model FLOPs Utilization tracking and optimization tools for performance tuning.
- PyTorch 2.6+ Compatibility: Full compatibility with latest PyTorch versions.
- Hugging Face Integration: Professional model conversion and upload to Hugging Face Hub with automatic model cards.
- Backward Compatibility: Original training script preserved as
train_old.py
for reference. - All dependencies are listed in
requirements.txt
.
For detailed information about improvements and migration, run:
python improvements_guide.py