The secret of life is not just about finding a place but about finding joy in the journey.
— minigpt (10M) completing the phrase "The secret of life is"
This repo implements all basic components to train a toy Transformer language model on a single GPU:
- Byte-pair encoding (BPE) tokenization
- Decoder-only Transformer with RoPE [1] positional embeddings
- A basic training loop with mixed-precision training and checkpointing
- Sampling/Decoding functions to generate text from a trained LM.
All configurations are stored in conf/ and read using hydra.
Checkpoints trained on SimpleStories [2] are available in the Releases page.
The provided default configuration sets up hyperparameters to train a tiny LM (~10M parameters) on SimpleStories, a text dataset that is similar in spirit to TinyStories [3] but provides more varied data.
The script get_simplestories.py pulls the dataset training split from HuggingFace, separating it into training and validation sets.
It is strongly recommended to run it with uv to automatically install all dependencies:
uv run get_simplestories.pyThis will write the train/validation corpora to the paths specified in the top-level config (train_corpus_path / val_corpus_path).
We can now run the main training script, which will:
- Train a BPE tokenizer and save the tokenized corpora on disk (see
tokenizationin the configuration); - Instatiate a Transformer LM (see
model); - Train the LM for a specified number of steps (see
training). Periodically, the model's loss on the validation set will be evaluated and a checkpoint will be saved.
To start training, simply run
uv run train.pyThe training process can be tracked using TensorBoard.
Once the model has been trained to convergence, we can use it to generate text by sampling tokens autoregressively, optionally from a starting prompt.
See the generate_text.ipynb notebook
for some examples.
Model checkpoints can be exported to the TensorFlow SavedModel format:
uv run export_to_tf.pySee the exporting configuration to change the checkpoint / output path.
Just overwrite the training and validation corpora:
wget -O data/train.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget -O data/valid.txt https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txtIf a training run was previously started with another dataset, it is necessary to remove the tokenizer and encoded data to re-train the BPE tokenizer:
rm tokenizer.pkl
rm -r data/tokenized/
uv run train.py # Start a new training run from scratchThe implemented model applies the following modifications to the original Transformer [4]:
- Rotary Position Embeddings (RoPE) [1];
- The usage of RMSNorm [5] instead of LayerNorm;
- Applying normalization before the attention and feed-forward blocks (pre-norm) rather than after;
- SwiGLU [6] feed-forward networks instead of ReLU MLPs;
- Untied input and output embeddings.
In conf/model, three model architectures are specified.
A pre-trained checkpoint for each model size is provided in the Releases page.
| Model name | Total parameter count | n_layers | d_model | d_ff | n_heads | n_ctx | d_vocab |
|---|---|---|---|---|---|---|---|
| 3M | 3,413,120 | 4 | 128 | 384 | 4 | 256 | 10,000 |
| 10M | 9,940,224 | 6 | 256 | 704 | 4 | 256 | 10,000 |
| 29M | 28,924,416 | 6 | 512 | 1344 | 16 | 256 | 10,000 |
Training curves for each of the three model sizes are shown below.
[1] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
[2] Finke, Lennart, et al. "Parameterized Synthetic Text Generation with SimpleStories." arXiv preprint arXiv:2504.09184 (2025).
[3] Eldan, Ronen, and Yuanzhi Li. "Tinystories: How small can language models be and still speak coherent english?." arXiv preprint arXiv:2305.07759 (2023).
[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[5] Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." Advances in neural information processing systems 32 (2019).
[6] Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).
