My First LM

Can you make your own language model---from scratch---in 20 minutes with just pen and paper? This repo contains a couple of software tools to help answer that question. (Spoiler: yep.)

The repo contains:

a Typst file (lm-grid.typ) and instructions (instructions.typ) for creating a word co-occurence matrix that can be used as a bigram language model.
a command-line tool (written in Rust) for processing a text corpus and typesetting it into an "N-gram model book", which can then be used for all sorts of fun things (e.g. dice-roll-powered proto-ChatGPT)

Why? Because it helps to make a simple version of something by hand to better understand what's going on.

This is a Cybernetic Studio artefact by Ben Swift as part of the Human-Scale AI project.

Installation

You will need:

Rust: Install the Rust toolchain (including cargo) from https://rustup.rs/.
Typst: Install Typst from https://github.com/typst/typst/.

To generate the grid & instructions for the DIY version of this procedure, simply typeset the lm-grid.typ and instructions.typ files and you're done.

To produce an N-gram book, read on...

Usage

The process involves two main steps: generating the N-gram statistics from your text file using the Rust program (writing them into a file called model.json) and then typesetting those statistics into a booklet using Typst.

Build the Statistics Generator: Navigate to the project directory in your terminal and build the Rust executable:
```
cargo build --release
```
This will create an executable file at target/release/my_first_lm.
Generate N-gram Statistics: Run the compiled executable, providing your input text file and optionally specifying an output JSON file.
```
./target/release/my_first_lm <input_text_file> [OPTIONS]
```
Arguments:
- <input_text_file>: Path to the text file you want to analyze.
Options:
- -o, --output <output_json_file>: Path where the generated N-gram statistics (in JSON format) will be saved. Defaults to model.json.
- -n, --n <N>: The size of the N-gram (e.g., 2 for bigrams, 3 for trigrams). Defaults to 2.
- --scale-d <D>: Optional integer value D to control count scaling.
  - If provided, and a prefix has a number of unique followers less than or equal to D, its follower counts are scaled to the range [1, D]. The total count for the prefix in the JSON will be D. This is useful to tailor the output to the specific die you intend to do for the random sampling.
  - If a prefix has more unique followers than D, or if --scale-d is not provided, its follower counts are scaled to the range [0, 10^k - 1], where k is the number of digits in the original total follower count for that prefix. The total count in the JSON becomes 10^k - 1. To sample a next word from these prefixes, roll the correct amound of d10 (indicated by a ♢ next to the word).
Example (generating d20-style scaled bigram statistics): Let's say you have your text in true-blue.txt and your die of choice is a d20:
```
./target/release/my_first_lm true-blue.txt -o true-blue.json -n 2 --scale-d 20
```
This command reads true-blue.txt, calculates bigram statistics. For prefixes with 20 or fewer unique followers, it scales counts to the range [1, 20]. For other prefixes, it uses the [0, 10^k-1] scaling. The result is saved to true-blue.json.

Or simply, to use --scale-d 20 with the default output model.json:
```
./target/release/my_first_lm true-blue.txt --scale-d 20
```
If you omit --scale-d entirely, all prefixes will use the [0, 10^k-1] scaling.
Generate the Booklet: The book.typ file is designed to read the statistics from a file named out.json in the same directory. Rename your generated JSON file to out.json (or create a symbolic link):
```
# If you kept the default model.json name
mv model.json out.json
# Or if you generated a custom-named file in the previous step
# mv true-blue.json out.json
```
Now, compile the Typst file to create the PDF booklet:
```
typst compile book.typ book.pdf
```
The final, printable booklet will be in book.pdf. Print it out, cut it up, and follow the assembly instructions (you might need to devise these!) to create your physical language model.

A note on input data

The input data file must have a YAML frontmatter block with string values for the following three keys:

the title of the text
the author of the text
the url of the text

These values are passed through to the book frontmatter in the final typeset pdf.

The tokenizer is deliberately very basic---all punctuation (except for contraction apostrophes) are removed, and everything is lowercased except for a few exceptions to do with personal pronouns. This is to ensure the final model is as small as possible (for a given input text).

Author

Ben Swift

This work is a project of the Cybernetic Studio at the ANU School of Cybernetics.

License

Source code for this project is licensed under the MIT License. See the LICENSE file for details.

The typeset "N-gram model booklets" are licenced under a CC BY-NC 4.0 license.

Source text licenses used as input for the language model remain as described in their original sources.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
backlog		backlog
data		data
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
book.typ		book.typ
lm-grid.typ		lm-grid.typ
mise.toml		mise.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

My First LM

Installation

Usage

A note on input data

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ANUcybernetics/my-first-lm

Folders and files

Latest commit

History

Repository files navigation

My First LM

Installation

Usage

A note on input data

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages