A simple example for implementing a Generative Pre-trained Transformer with pytorch.
The entire code for the model is taken from Andrej Karpathy's video course Neural Networks: Zero to Hero, forked from https://github.com/karpathy/ng-video-lecture
The original contents of the README.md file are preserved at the bottom of this document.
This repository differs from the original in the following ways:
- Support loading multiple files
- Store model to disk during and after training
- Put code used for training and inference into separate scripts
- Partially added support for training at fp16 percision for decreased memory usage (source)
- Added code to filter input data by removing characters that don't occurr very often (e.g. chinese characters from english wikipedia dumps)
- Store model parameters and the preprocessed dataset in a pickle file and load them, if available (makes loading the data around 100x faster)
- Support temperature control when generating text
- Support for continuation of interrupted training (call train.py with parameter "continue", like
python train.py continue
)
Many improvements could be made to saving/loading the models, data, filters and parameters, but as this is just a research project, the code is meant to be kept simple. Also there exist already a dozen highly optimized implementations of the demonstrated techniques.
conda create -n simple-gpt python=3.9
conda activate simple-gpt
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Also, Cuda 11.7 or higher must be installed and the CUDA_PATH environment variable must point to the corresponding directory, e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8
Files are loaded from /input. Each file is split into 90% training data and 10% test data.
To start training, execute train.py
python train.py
The resulting model will be saved to /models/model/model-last.pt after every evaluation.
Additionally, the evaluation results will be used to save the model to model-best-train.pt and model-best-val.pt.
To perform sentence completion using the trained network, run
python generate.py
Finding the right hyperparameters for the available training setup and dataset is crucial for efficiency and effectivity. I personally found the following setup to work decent on an RTX 3080 Ti training on about 50mb of mostly topic-specific wikipedia dumps (containing 149 different characters):
batch_size = 16
block_size = 256
n_embd = 720
n_head = 18
n_layer = 18
dropout = 0.2
The parameters at the head of gpt.py are used in the lecture and are optimized (?) for the Shakespeare texts.
To overcome VRAM limits, try lowering the batch size. This also means, that the network will converge slower, but that can at least be overcome by patience.
The included input text files contain about 1MB of Shakespeare's words.
This repository contains a few preprocessed text-files from relatively recent wikipedia dumps (01/23), all in UTF-8. To keep the network at a reasonable size, filtering obscure characters is mandatory.
Code created in the Neural Networks: Zero To Hero video lecture series, specifically on the first lecture on nanoGPT. Publishing here as a Github repo so people can easily hack it, walk through the git log
history of it, etc.
NOTE: sadly I did not go too much into model initialization in the video lecture, but it is quite important for good performance. The current code will train and work fine, but its convergence is slower because it starts off in a not great spot in the weight space. Please see nanoGPT model.py for # init all weights
comment, and especially how it calls the _init_weights
function. Even more sadly, the code in this repo is a bit different in how it names and stores the various modules, so it's not possible to directly copy paste this code here. My current plan is to publish a supplementary video lecture and cover these parts, then I will also push the exact code changes to this repo. For now I'm keeping it as is so it is almost exactly what we actually covered in the video.
MIT