This project provides a modular implementation of the GPT-2 language model, allowing for easy training and inference. It includes a configurable model architecture, data loading utilities, and scripts for both training and text generation. A few modern optimizations included that weren't in the original GPT-2 paper but are commonly used in implementations:
- The use of F.scaled_dot_product_attention with is_causal=True for efficient attention computation.
- Some initialization tweaks, like the NANOGPT_SCALE_INIT attribute.
These optimizations don't change the fundamental architecture but can improve training efficiency.
The GPT-2 model implemented in this project follows the architecture described in the original paper "Language Models are Unsupervised Multitask Learners" by Radford et al. The key components of the model are:
-
Token and Positional Embeddings: Convert input tokens into embeddings and add positional information.
-
Transformer Blocks: A series of blocks, each containing:
- Multi-Head Attention: Allows the model to attend to different parts of the input sequence.
- Layer Normalization: Normalizes the outputs of the attention and feed-forward layers.
- Feed-Forward Neural Network: Processes the attention output.
-
Final Layer Normalization: Applied after the last transformer block.
-
Language Model Head: A linear layer that projects the final hidden states to vocabulary-sized logits.
The model uses the following key classes:
GPT2
: The main model class that combines all components.Block
: Represents a single transformer block.CausalSelfAttention
: Implements the multi-head self-attention mechanism with causal masking.MLP
: The feed-forward neural network used in each block.
project/
├── config/
│ └── default_config.yaml
├── src/
│ ├── __init__.py
│ ├── model.py
│ ├── data_loader.py
│ ├── train.py
│ ├── inference.py
│ └── utils.py
├── main.py
├── requirements.txt
└── README.md
-
Clone the repository:
git clone https://github.com/yourusername/gpt2-implementation.git cd gpt2-implementation
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
To train the model:
python main.py --config config/default_config.yaml --mode train
This will start the training process using the settings specified in the config file. The script will log training progress and save model checkpoints periodically.
To generate text using a trained model:
python main.py --config config/default_config.yaml --mode inference --prompt "Your prompt here"
Replace "Your prompt here" with the text you want to use as a starting point for generation.
The config/default_config.yaml
file contains all the configurable parameters for the model and training process. You can modify this file to change:
- Model architecture (e.g., number of layers, embedding size)
- Training settings (e.g., batch size, learning rate)
- Data source
- Logging and checkpoint saving frequency
Here's an example of the configuration structure:
model:
block_size: 1024
vocab_size: 50257
n_layer: 12
n_head: 12
n_embd: 768
training:
num_epochs: 50
batch_size: 4
sequence_length: 32
learning_rate: 3e-4
device: 'cuda'
data:
input_file: 'input.txt'
logging:
log_interval: 10
save_interval: 1000
model_save_path: 'checkpoints/'
Yalala Mohit