Skip to content

Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551

Open
augustasio wants to merge 2 commits into
openai:mainfrom
augustasio:pair-table-optimization
Open

Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551
augustasio wants to merge 2 commits into
openai:mainfrom
augustasio:pair-table-optimization

Conversation

@augustasio
Copy link
Copy Markdown

@augustasio augustasio commented May 22, 2026

Summary

Adds a precomputed 2-byte pair to rank table (Box<[Rank; 65536]>, ~256 KB) on CoreBPE, used by the encoding methods to skip the hashmap lookup for the hot initial adjacent-pair scan in both _byte_pair_merge (linear, pieces < 100 bytes) and _byte_pair_merge_large (heap-based, pieces ≥ 100 bytes). Subsequent merges (3+ byte spans) still go through the encoder hashmap; those keys don't fit in a u16 anyway.

Token output is byte-identical to vanilla. The optimization changes only how the initial 2-byte pair scan is implemented.

Headline result

+6.2% size-weighted aggregate encode_ordinary throughput across 104 MiB of corpus on this codebase. Biggest wins on CJK Wikipedia (+8.6% to +13.8%) and FLORES multilingual (+7.8% to +7.9%) where the merger dominates encode time. Slight regression on dense numeric content (-1.3%) and roughly flat on Python source (-0.8% to -3.0%).

Apple M4 base (4P + 6E cores), single-thread encode_ordinary via Python wrapper, 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU) across vanilla (PyPI tiktoken 0.13.0) and this patched branch to control thermal drift.

Full per-corpus table in the appendix.

Correctness

Two layers of byte-equality verification:

  1. Rust path-equivalence tests (added in this PR, in pair_table_equivalence module): assert byte_pair_encode and byte_pair_encode_with_table produce identical output for piece lengths spanning the 100-byte linear-vs-heap dispatch cutoff (1, 2, 50, 98, 99, 100, 101, 200, 500, 1000), plus repeated-pair stress tests. 6 new tests, all pass alongside the 2 pre-existing tests.
  2. Full-corpus byte-equality across all four built-in encodings (r50k_base, p50k_base, cl100k_base, o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.

Disclosures

  • +256 KB heap per CoreBPE instance, built once at construction. Read-only after init, so Linux fork-based serving keeps it copy-on-write shared. For typical use (one tokenizer per process) the cost is invisible; multi-tokenizer-per-process systems should know.
  • Construction overhead is sub-millisecond. Under 1 ms on the largest vocab (o200k_base, ~200k entries), within measurement noise on smaller ones. Full table in the appendix.

Backward compatibility

  • New pair_table field on CoreBPE is private to the module. External code constructs via CoreBPE::new as before; no API change visible to users.
  • _byte_pair_merge and _byte_pair_merge_large gain an optional pair_table: Option<&[Rank; 65536]> parameter. Both are private functions; signature change is internal only.
  • byte_pair_encode and byte_pair_split (public) keep their existing signatures and pass None internally, preserving back-compat for external Rust users.
  • A new private helper byte_pair_encode_with_table is added (parallel to byte_pair_encode) for CoreBPE methods to use without changing the public function.
  • No new public APIs, no behavior changes, no dependency additions.

Testing

  • Rust: cargo test passes 8/8 tests (2 pre-existing + 6 new path-equivalence tests covering both the linear and heap merge paths).
  • Python: pytest tests/ passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).
  • Byte-equality: ~238 MiB curated corpus across 4 encodings yields identical token sequences vs vanilla PyPI tiktoken (sha256 hash match on all 128 pairs).

Appendix

A. Methodology

Apple M4 base (4P + 6E cores). Single-thread encode_ordinary via tiktoken's Python wrapper (which is itself a thin PyO3 binding around the Rust CoreBPE::encode_ordinary). 3 rounds × 5 iterations per round per file, alternating between vanilla and patched runs (to control for thermal drift on the CPU) between vanilla and patched to control thermal drift. Median across iterations within a round, then median across rounds; aggregate is sum(bytes) / sum(median_times).

Vanilla side: PyPI tiktoken==0.13.0 (installed from the package index in a clean venv).
Patched side: editable install of this branch (pip install -e .).

Byte-equality validated by encoding the full ~238 MiB curated corpus with both implementations and verifying token sequence sha256 hashes match across all (encoding × file) pairs.

B. Corpora used in this PR's measurements

8 files for the perf bench (selected to cover diverse content):

Corpus Tier Source What it tests
wiki_zh 1 MB & 15 MB wikimedia/wikipedia (zh) Chinese Wikipedia, CJK-heavy
flores200 1 MB & 15 MB Meta FLORES-200 Concatenated devtest of 200 languages
code_python 1 MB & 15 MB bigcode/the-stack-smol Python source code
wiki_en 15 MB wikimedia/wikipedia (en) English Wikipedia
synthetic_numeric 1 MB procedural (CSV / JSON / hex / UUIDs / IPv4 / dates) Numeric-heavy content where 2-byte digit pairs are rare in the encoder

The byte-equality run used a larger 32-file corpus: 16 files at 1 MB each (the per-file content listed above, plus code_go/javascript/rust, FLORES, wiki_ar/hi/ja/ru, wikitext_103, plus 4 procedural stress corpora: stress_doc, synthetic_numeric, merger_long_pieces, merger_repeated_pieces) and the 15 MB equivalents.

C. Procedural corpora (detail + samples)

The four procedural corpora are generated locally. Each targets a specific algorithmic regime.

stress_doc: mixed-content document (random words, numbers, punctuation) with no paragraph breaks. Probes the long-input regime where fancy_regex can hit catastrophic-backtracking patterns.

sdmrulm ljwotlewvjc ) pnojbbyvh qzvrqbvlqsi ) wubpj bzrhxsjlmkuj . ; !
lpdcuvdauvu mysvwkscfyk ( 5492019 54359 evszryfefjz lba - - iflkgcfokjce
inodggcm 61 ? fsmqhrhm 98595 - 655044427 8691240 . yqvpa ...

synthetic_numeric: procedurally generated CSV rows, JSON logs, hex dumps, ISO timestamps, IPv4 addresses, and UUIDs.

57.12.140.125
{"k0":67.66994874229113,"k1":91161,"k2":23.266089339073957}
649.8844,236696312,805.8193,298362082,379.9273,"item711"
0x1e14ae531acfdd67d25aba1a8374eb35c0dc61e9b37ea22b8b695f27cd22a969...
58478be9-3761-a56d-0aa0-bc138227d1cd
Mon Dec 11 2020 01:37:30 UTC

merger_long_pieces: random alphanumeric runs of 100 to 500 bytes per piece. Hits the _byte_pair_merge_large heap path.

GRhxrnJgWQsNKVPwvrjFKqt6YrvfDliUWc7QSTR0YwDCXjD9T9M8Tra2JLIr6IzwgnS9
VPDi5TaO5deAlGQ18dD889UWuXeEys2Btiii5yvBqJiUFwcoEF1Q9Ci5wg74ja4z... (continues)

merger_repeated_pieces: ~10 unique pieces of 20 to 50 bytes, sampled with repetition.

uieahkwaxsqlfhgqyfazpgmefbcszuerrswx vjkjswbpvrgjxfyknsrpwqxqpbxffbqjqvyuvjbu
mxokvbjscokfktrpukpcrrquzszrhwrqdoioyjado nytzghujarzemafrxxwgcffslrhsfxqom
eyktdqsjntsjmhgcgtxwewy eyktdqsjntsjmhgcgtxwewy ...   (same piece, repeats)

D. Per-file throughput (single-thread encode_ordinary, o200k_base)

Median across 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU).

File Vanilla (MiB/s) Patched (MiB/s) Δ
wiki_zh (1 MB) 18.5 21.1 +13.8%
flores200 (1 MB) 23.9 25.7 +7.9%
wiki_zh (15 MB) 18.3 19.9 +8.6%
flores200 (15 MB) 21.6 23.3 +7.8%
wiki_en (15 MB) 29.2 30.2 +3.4%
code_python (15 MB) 25.7 25.5 -0.8%
synthetic_numeric (1 MB) 16.3 16.1 -1.3%
code_python (1 MB) 25.9 25.1 -3.0%
Aggregate (size-weighted, 104 MiB) 22.8 24.2 +6.2%

Slight regression on dense numeric content is expected: most 2-byte digit pairs aren't in the encoder vocab (Rank::MAX in the table), so the table-lookup work doesn't avoid as many useful hashmap calls. Code is dominated by byte_pair_encode_with_table cases that bypass the hot path (pieces matching tokens directly), so the win is small there too.

E. Construction-time overhead (per-encoding)

Measured separately, criterion 100 samples × 5 s on Apple M4:

Encoding Vocab Vanilla + pair table Δ
o200k_base ~200k 48.7 ms 49.6 ms +0.9 ms
cl100k_base ~100k 23.1 ms 23.4 ms +0.3 ms
p50k_base ~50k 11.4 ms 11.3 ms noise
r50k_base ~50k 11.2 ms 11.5 ms +0.3 ms

Scales linearly with vocab size (one pass over the encoder entries). Sub-millisecond on the largest vocab.

@augustasio augustasio force-pushed the pair-table-optimization branch from 97876da to 529021b Compare May 22, 2026 20:02
Covers both code paths: linear `_byte_pair_merge` (pieces < 100 bytes) and
heap-based `_byte_pair_merge_large` (pieces >= 100 bytes), with edge cases
straddling the 100-byte cutoff.
@augustasio augustasio force-pushed the pair-table-optimization branch from 529021b to 4d95458 Compare May 22, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant