minigpt

The secret of life is not just about finding a place but about finding joy in the journey.

— minigpt (10M) completing the phrase "The secret of life is"

This repo implements all basic components to train a toy Transformer language model on a single GPU:

Byte-pair encoding (BPE) tokenization
Decoder-only Transformer with RoPE [1] positional embeddings
A basic training loop with mixed-precision training and checkpointing
Sampling/Decoding functions to generate text from a trained LM.

All configurations are stored in conf/ and read using hydra. Checkpoints trained on SimpleStories [2] are available in the Releases page.

Training

The provided default configuration sets up hyperparameters to train a tiny LM (~10M parameters) on SimpleStories, a text dataset that is similar in spirit to TinyStories [3] but provides more varied data.

Downloading the dataset

The script get_simplestories.py pulls the dataset training split from HuggingFace, separating it into training and validation sets. It is strongly recommended to run it with uv to automatically install all dependencies:

uv run get_simplestories.py

This will write the train/validation corpora to the paths specified in the top-level config (train_corpus_path / val_corpus_path).

Training the LM

We can now run the main training script, which will:

Train a BPE tokenizer and save the tokenized corpora on disk (see tokenization in the configuration);
Instatiate a Transformer LM (see model);
Train the LM for a specified number of steps (see training). Periodically, the model's loss on the validation set will be evaluated and a checkpoint will be saved.

To start training, simply run

uv run train.py

The training process can be tracked using TensorBoard.

Generating text from a trained checkpoint

Once the model has been trained to convergence, we can use it to generate text by sampling tokens autoregressively, optionally from a starting prompt. See the generate_text.ipynb notebook for some examples.

Exporting to TF SavedModel

Model checkpoints can be exported to the TensorFlow SavedModel format:

uv run export_to_tf.py

See the exporting configuration to change the checkpoint / output path.

Training on TinyStories

Just overwrite the training and validation corpora:

wget -O data/train.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget -O data/valid.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

If a training run was previously started with another dataset, it is necessary to remove the tokenizer and encoded data to re-train the BPE tokenizer:

rm tokenizer.pkl
rm -r data/tokenized/
uv run train.py # Start a new training run from scratch

Transformer architecture

The implemented model applies the following modifications to the original Transformer [4]:

Rotary Position Embeddings (RoPE) [1];
The usage of RMSNorm [5] instead of LayerNorm;
Applying normalization before the attention and feed-forward blocks (pre-norm) rather than after;
SwiGLU [6] feed-forward networks instead of ReLU MLPs;
Untied input and output embeddings.

Model sizes

In conf/model, three model architectures are specified. A pre-trained checkpoint for each model size is provided in the Releases page.

Model name	Total parameter count	n_layers	d_model	d_ff	n_heads	n_ctx	d_vocab
3M	3,413,120	4	128	384	4	256	10,000
10M	9,940,224	6	256	704	4	256	10,000
29M	28,924,416	6	512	1344	16	256	10,000

Training curves for each of the three model sizes are shown below.

Resources and references

CS336 Language Modeling from Scratch (Stanford)
The Illustrated Transformer by Jay Alammar

[1] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.

[2] Finke, Lennart, et al. "Parameterized Synthetic Text Generation with SimpleStories." arXiv preprint arXiv:2504.09184 (2025).

[3] Eldan, Ronen, and Yuanzhi Li. "Tinystories: How small can language models be and still speak coherent english?." arXiv preprint arXiv:2305.07759 (2023).

[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[5] Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." Advances in neural information processing systems 32 (2019).

[6] Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
conf		conf
media		media
src/minigpt		src/minigpt
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
export_to_tf.py		export_to_tf.py
generate_text.ipynb		generate_text.ipynb
get_simplestories.py		get_simplestories.py
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

minigpt

Training

Downloading the dataset

Training the LM

Generating text from a trained checkpoint

Exporting to TF SavedModel

Training on TinyStories

Transformer architecture

Model sizes

Resources and references

About

Uh oh!

Releases 1

Packages

Languages

License

jongoiko/minigpt

Folders and files

Latest commit

History

Repository files navigation

minigpt

Training

Downloading the dataset

Training the LM

Generating text from a trained checkpoint

Exporting to TF SavedModel

Training on TinyStories

Transformer architecture

Model sizes

Resources and references

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages