Simple GPT Implementation

A simple example for implementing a Generative Pre-trained Transformer with pytorch. The entire code for the model is taken from Andrej Karpathy's video course Neural Networks: Zero to Hero, forked from https://github.com/karpathy/ng-video-lecture
The original contents of the README.md file are preserved at the bottom of this document.

This repository differs from the original in the following ways:

Support loading multiple files
Store model to disk during and after training
Put code used for training and inference into separate scripts
Partially added support for training at fp16 percision for decreased memory usage (source)
Added code to filter input data by removing characters that don't occurr very often (e.g. chinese characters from english wikipedia dumps)
Store model parameters and the preprocessed dataset in a pickle file and load them, if available (makes loading the data around 100x faster)
Support temperature control when generating text
Support for continuation of interrupted training (call train.py with parameter "continue", like python train.py continue)

Many improvements could be made to saving/loading the models, data, filters and parameters, but as this is just a research project, the code is meant to be kept simple. Also there exist already a dozen highly optimized implementations of the demonstrated techniques.

Setup for NVIDIA GPUs

conda create -n simple-gpt python=3.9
conda activate simple-gpt
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Also, Cuda 11.7 or higher must be installed and the CUDA_PATH environment variable must point to the corresponding directory, e.g. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8

Training the network

Files are loaded from /input. Each file is split into 90% training data and 10% test data.

To start training, execute train.py

python train.py

The resulting model will be saved to /models/model/model-last.pt after every evaluation.
Additionally, the evaluation results will be used to save the model to model-best-train.pt and model-best-val.pt.

Generating something from the network

To perform sentence completion using the trained network, run

python generate.py

Parameters

Finding the right hyperparameters for the available training setup and dataset is crucial for efficiency and effectivity. I personally found the following setup to work decent on an RTX 3080 Ti training on about 50mb of mostly topic-specific wikipedia dumps (containing 149 different characters):

batch_size = 16

block_size = 256
n_embd = 720
n_head = 18
n_layer = 18
dropout = 0.2

The parameters at the head of gpt.py are used in the lecture and are optimized (?) for the Shakespeare texts.

To overcome VRAM limits, try lowering the batch size. This also means, that the network will converge slower, but that can at least be overcome by patience.

Finding data to train on

The included input text files contain about 1MB of Shakespeare's words.

This repository contains a few preprocessed text-files from relatively recent wikipedia dumps (01/23), all in UTF-8. To keep the network at a reasonable size, filtering obscure characters is mandatory.

nanogpt-lecture

Code created in the Neural Networks: Zero To Hero video lecture series, specifically on the first lecture on nanoGPT. Publishing here as a Github repo so people can easily hack it, walk through the git log history of it, etc.

NOTE: sadly I did not go too much into model initialization in the video lecture, but it is quite important for good performance. The current code will train and work fine, but its convergence is slower because it starts off in a not great spot in the weight space. Please see nanoGPT model.py for # init all weights comment, and especially how it calls the _init_weights function. Even more sadly, the code in this repo is a bit different in how it names and stores the various modules, so it's not possible to directly copy paste this code here. My current plan is to publish a supplementary video lecture and cover these parts, then I will also push the exact code changes to this repo. For now I'm keeping it as is so it is almost exactly what we actually covered in the video.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
input		input
.gitignore		.gitignore
README.md		README.md
char_filter.py		char_filter.py
generate.py		generate.py
gpt.py		gpt.py
optimizer.py		optimizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple GPT Implementation

Setup for NVIDIA GPUs

Training the network

Generating something from the network

Parameters

Finding data to train on

nanogpt-lecture

License

About

Releases

Packages

Languages

TheMcSebi/simple-gpt-implementation

Folders and files

Latest commit

History

Repository files navigation

Simple GPT Implementation

Setup for NVIDIA GPUs

Training the network

Generating something from the network

Parameters

Finding data to train on

nanogpt-lecture

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages