Can you make your own language model---from scratch---in 20 minutes with just pen and paper? This repo contains a couple of software tools to help answer that question. (Spoiler: yep.)
The repo contains:
-
a Typst file (
lm-grid.typ
) and instructions (instructions.typ
) for creating a word co-occurence matrix that can be used as a bigram language model. -
a command-line tool (written in Rust) for processing a text corpus and typesetting it into an "N-gram model book", which can then be used for all sorts of fun things (e.g. dice-roll-powered proto-ChatGPT)
Why? Because it helps to make a simple version of something by hand to better understand what's going on.
This is a Cybernetic Studio artefact by Ben Swift as part of the Human-Scale AI project.
You will need:
- Rust: Install the Rust toolchain (including
cargo
) from https://rustup.rs/. - Typst: Install Typst from https://github.com/typst/typst/.
To generate the grid & instructions for the DIY version of this procedure,
simply typeset the lm-grid.typ
and instructions.typ
files and you're done.
To produce an N-gram book, read on...
The process involves two main steps: generating the N-gram statistics from your
text file using the Rust program (writing them into a file called model.json
)
and then typesetting those statistics into a booklet using Typst.
-
Build the Statistics Generator: Navigate to the project directory in your terminal and build the Rust executable:
cargo build --release
This will create an executable file at
target/release/my_first_lm
. -
Generate N-gram Statistics: Run the compiled executable, providing your input text file and optionally specifying an output JSON file.
./target/release/my_first_lm <input_text_file> [OPTIONS]
Arguments:
<input_text_file>
: Path to the text file you want to analyze.
Options:
-o, --output <output_json_file>
: Path where the generated N-gram statistics (in JSON format) will be saved. Defaults tomodel.json
.-n, --n <N>
: The size of the N-gram (e.g.,2
for bigrams,3
for trigrams). Defaults to2
.--scale-d <D>
: Optional integer valueD
to control count scaling.- If provided, and a prefix has a number of unique followers less than or
equal to
D
, its follower counts are scaled to the range[1, D]
. The total count for the prefix in the JSON will beD
. This is useful to tailor the output to the specific die you intend to do for the random sampling. - If a prefix has more unique followers than
D
, or if--scale-d
is not provided, its follower counts are scaled to the range[0, 10^k - 1]
, wherek
is the number of digits in the original total follower count for that prefix. The total count in the JSON becomes10^k - 1
. To sample a next word from these prefixes, roll the correct amound of d10 (indicated by a ♢ next to the word).
- If provided, and a prefix has a number of unique followers less than or
equal to
Example (generating d20-style scaled bigram statistics): Let's say you have your text in
true-blue.txt
and your die of choice is a d20:./target/release/my_first_lm true-blue.txt -o true-blue.json -n 2 --scale-d 20
This command reads
true-blue.txt
, calculates bigram statistics. For prefixes with 20 or fewer unique followers, it scales counts to the range [1, 20]. For other prefixes, it uses the[0, 10^k-1]
scaling. The result is saved totrue-blue.json
.Or simply, to use
--scale-d 20
with the default outputmodel.json
:./target/release/my_first_lm true-blue.txt --scale-d 20
If you omit
--scale-d
entirely, all prefixes will use the[0, 10^k-1]
scaling. -
Generate the Booklet: The
book.typ
file is designed to read the statistics from a file namedout.json
in the same directory. Rename your generated JSON file toout.json
(or create a symbolic link):# If you kept the default model.json name mv model.json out.json # Or if you generated a custom-named file in the previous step # mv true-blue.json out.json
Now, compile the Typst file to create the PDF booklet:
typst compile book.typ book.pdf
The final, printable booklet will be in
book.pdf
. Print it out, cut it up, and follow the assembly instructions (you might need to devise these!) to create your physical language model.
The input data file must have a YAML frontmatter block with string values for the following three keys:
- the
title
of the text - the
author
of the text - the
url
of the text
These values are passed through to the book frontmatter in the final typeset pdf.
The tokenizer is deliberately very basic---all punctuation (except for contraction apostrophes) are removed, and everything is lowercased except for a few exceptions to do with personal pronouns. This is to ensure the final model is as small as possible (for a given input text).
Ben Swift
This work is a project of the Cybernetic Studio at the ANU School of Cybernetics.
Source code for this project is licensed under the MIT License. See the LICENSE file for details.
The typeset "N-gram model booklets" are licenced under a CC BY-NC 4.0 license.
Source text licenses used as input for the language model remain as described in their original sources.