Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan by augustasio · Pull Request #551 · openai/tiktoken

augustasio · 2026-05-22T12:50:29Z

Summary

Adds a precomputed 2-byte pair to rank table (Box<[Rank; 65536]>, ~256 KB) on CoreBPE, used by the encoding methods to skip the hashmap lookup for the hot initial adjacent-pair scan in both _byte_pair_merge (linear, pieces < 100 bytes) and _byte_pair_merge_large (heap-based, pieces ≥ 100 bytes). Subsequent merges (3+ byte spans) still go through the encoder hashmap; those keys don't fit in a u16 anyway.

Token output is byte-identical to vanilla. The optimization changes only how the initial 2-byte pair scan is implemented.

Headline result

+6.2% size-weighted aggregate encode_ordinary throughput across 104 MiB of corpus on this codebase. Biggest wins on CJK Wikipedia (+8.6% to +13.8%) and FLORES multilingual (+7.8% to +7.9%) where the merger dominates encode time. Slight regression on dense numeric content (-1.3%) and roughly flat on Python source (-0.8% to -3.0%).

Apple M4 base (4P + 6E cores), single-thread encode_ordinary via Python wrapper, 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU) across vanilla (PyPI tiktoken 0.13.0) and this patched branch to control thermal drift.

Full per-corpus table in the appendix.

Correctness

Two layers of byte-equality verification:

Rust path-equivalence tests (added in this PR, in pair_table_equivalence module): assert byte_pair_encode and byte_pair_encode_with_table produce identical output for piece lengths spanning the 100-byte linear-vs-heap dispatch cutoff (1, 2, 50, 98, 99, 100, 101, 200, 500, 1000), plus repeated-pair stress tests. 6 new tests, all pass alongside the 2 pre-existing tests.
Full-corpus byte-equality across all four built-in encodings (r50k_base, p50k_base, cl100k_base, o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.

Disclosures

+256 KB heap per CoreBPE instance, built once at construction. Read-only after init, so Linux fork-based serving keeps it copy-on-write shared. For typical use (one tokenizer per process) the cost is invisible; multi-tokenizer-per-process systems should know.
Construction overhead is sub-millisecond. Under 1 ms on the largest vocab (o200k_base, ~200k entries), within measurement noise on smaller ones. Full table in the appendix.

Backward compatibility

New pair_table field on CoreBPE is private to the module. External code constructs via CoreBPE::new as before; no API change visible to users.
_byte_pair_merge and _byte_pair_merge_large gain an optional pair_table: Option<&[Rank; 65536]> parameter. Both are private functions; signature change is internal only.
byte_pair_encode and byte_pair_split (public) keep their existing signatures and pass None internally, preserving back-compat for external Rust users.
A new private helper byte_pair_encode_with_table is added (parallel to byte_pair_encode) for CoreBPE methods to use without changing the public function.
No new public APIs, no behavior changes, no dependency additions.

Testing

Rust: cargo test passes 8/8 tests (2 pre-existing + 6 new path-equivalence tests covering both the linear and heap merge paths).
Python: pytest tests/ passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).
Byte-equality: ~238 MiB curated corpus across 4 encodings yields identical token sequences vs vanilla PyPI tiktoken (sha256 hash match on all 128 pairs).

Appendix

A. Methodology

Apple M4 base (4P + 6E cores). Single-thread encode_ordinary via tiktoken's Python wrapper (which is itself a thin PyO3 binding around the Rust CoreBPE::encode_ordinary). 3 rounds × 5 iterations per round per file, alternating between vanilla and patched runs (to control for thermal drift on the CPU) between vanilla and patched to control thermal drift. Median across iterations within a round, then median across rounds; aggregate is sum(bytes) / sum(median_times).

Vanilla side: PyPI tiktoken==0.13.0 (installed from the package index in a clean venv).
Patched side: editable install of this branch (pip install -e .).

Byte-equality validated by encoding the full ~238 MiB curated corpus with both implementations and verifying token sequence sha256 hashes match across all (encoding × file) pairs.

B. Corpora used in this PR's measurements

8 files for the perf bench (selected to cover diverse content):

Corpus	Tier	Source	What it tests
`wiki_zh`	1 MB & 15 MB	wikimedia/wikipedia (zh)	Chinese Wikipedia, CJK-heavy
`flores200`	1 MB & 15 MB	Meta FLORES-200	Concatenated devtest of 200 languages
`code_python`	1 MB & 15 MB	bigcode/the-stack-smol	Python source code
`wiki_en`	15 MB	wikimedia/wikipedia (en)	English Wikipedia
`synthetic_numeric`	1 MB	procedural (CSV / JSON / hex / UUIDs / IPv4 / dates)	Numeric-heavy content where 2-byte digit pairs are rare in the encoder

The byte-equality run used a larger 32-file corpus: 16 files at 1 MB each (the per-file content listed above, plus code_go/javascript/rust, FLORES, wiki_ar/hi/ja/ru, wikitext_103, plus 4 procedural stress corpora: stress_doc, synthetic_numeric, merger_long_pieces, merger_repeated_pieces) and the 15 MB equivalents.

C. Procedural corpora (detail + samples)

The four procedural corpora are generated locally. Each targets a specific algorithmic regime.

stress_doc: mixed-content document (random words, numbers, punctuation) with no paragraph breaks. Probes the long-input regime where fancy_regex can hit catastrophic-backtracking patterns.

sdmrulm ljwotlewvjc ) pnojbbyvh qzvrqbvlqsi ) wubpj bzrhxsjlmkuj . ; !
lpdcuvdauvu mysvwkscfyk ( 5492019 54359 evszryfefjz lba - - iflkgcfokjce
inodggcm 61 ? fsmqhrhm 98595 - 655044427 8691240 . yqvpa ...

synthetic_numeric: procedurally generated CSV rows, JSON logs, hex dumps, ISO timestamps, IPv4 addresses, and UUIDs.

57.12.140.125
{"k0":67.66994874229113,"k1":91161,"k2":23.266089339073957}
649.8844,236696312,805.8193,298362082,379.9273,"item711"
0x1e14ae531acfdd67d25aba1a8374eb35c0dc61e9b37ea22b8b695f27cd22a969...
58478be9-3761-a56d-0aa0-bc138227d1cd
Mon Dec 11 2020 01:37:30 UTC

merger_long_pieces: random alphanumeric runs of 100 to 500 bytes per piece. Hits the _byte_pair_merge_large heap path.

GRhxrnJgWQsNKVPwvrjFKqt6YrvfDliUWc7QSTR0YwDCXjD9T9M8Tra2JLIr6IzwgnS9
VPDi5TaO5deAlGQ18dD889UWuXeEys2Btiii5yvBqJiUFwcoEF1Q9Ci5wg74ja4z... (continues)

merger_repeated_pieces: ~10 unique pieces of 20 to 50 bytes, sampled with repetition.

uieahkwaxsqlfhgqyfazpgmefbcszuerrswx vjkjswbpvrgjxfyknsrpwqxqpbxffbqjqvyuvjbu
mxokvbjscokfktrpukpcrrquzszrhwrqdoioyjado nytzghujarzemafrxxwgcffslrhsfxqom
eyktdqsjntsjmhgcgtxwewy eyktdqsjntsjmhgcgtxwewy ...   (same piece, repeats)

D. Per-file throughput (single-thread `encode_ordinary`, o200k_base)

Median across 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU).

File	Vanilla (MiB/s)	Patched (MiB/s)	Δ
`wiki_zh` (1 MB)	18.5	21.1	+13.8%
`flores200` (1 MB)	23.9	25.7	+7.9%
`wiki_zh` (15 MB)	18.3	19.9	+8.6%
`flores200` (15 MB)	21.6	23.3	+7.8%
`wiki_en` (15 MB)	29.2	30.2	+3.4%
`code_python` (15 MB)	25.7	25.5	-0.8%
`synthetic_numeric` (1 MB)	16.3	16.1	-1.3%
`code_python` (1 MB)	25.9	25.1	-3.0%
Aggregate (size-weighted, 104 MiB)	22.8	24.2	+6.2%

Slight regression on dense numeric content is expected: most 2-byte digit pairs aren't in the encoder vocab (Rank::MAX in the table), so the table-lookup work doesn't avoid as many useful hashmap calls. Code is dominated by byte_pair_encode_with_table cases that bypass the hot path (pieces matching tokens directly), so the win is small there too.

E. Construction-time overhead (per-encoding)

Measured separately, criterion 100 samples × 5 s on Apple M4:

Encoding	Vocab	Vanilla	+ pair table	Δ
`o200k_base`	~200k	48.7 ms	49.6 ms	+0.9 ms
`cl100k_base`	~100k	23.1 ms	23.4 ms	+0.3 ms
`p50k_base`	~50k	11.4 ms	11.3 ms	noise
`r50k_base`	~50k	11.2 ms	11.5 ms	+0.3 ms

Scales linearly with vocab size (one pass over the encoder entries). Sub-millisecond on the largest vocab.

Covers both code paths: linear `_byte_pair_merge` (pieces < 100 bytes) and heap-based `_byte_pair_merge_large` (pieces >= 100 bytes), with edge cases straddling the 100-byte cutoff.

Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan

53c13ab

augustasio force-pushed the pair-table-optimization branch from 97876da to 529021b Compare May 22, 2026 20:02

Test: assert pair-table path produces identical output to hashmap path

4d95458

Covers both code paths: linear `_byte_pair_merge` (pieces < 100 bytes) and heap-based `_byte_pair_merge_large` (pieces >= 100 bytes), with edge cases straddling the 100-byte cutoff.

augustasio force-pushed the pair-table-optimization branch from 529021b to 4d95458 Compare May 22, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551

Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551
augustasio wants to merge 2 commits into
openai:mainfrom
augustasio:pair-table-optimization

augustasio commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

augustasio commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline result

Correctness

Disclosures

Backward compatibility

Testing

Appendix

A. Methodology

B. Corpora used in this PR's measurements

C. Procedural corpora (detail + samples)

D. Per-file throughput (single-thread encode_ordinary, o200k_base)

E. Construction-time overhead (per-encoding)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

augustasio commented May 22, 2026 •

edited

Loading

D. Per-file throughput (single-thread `encode_ordinary`, o200k_base)