Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex by augustasio · Pull Request #552 · openai/tiktoken

augustasio · 2026-05-22T16:02:31Z

Summary

Replaces fancy_regex::Regex::find_iter with a hand-coded state-machine lexer for the three canonical OpenAI BPE pretokenization patterns. At CoreBPE::new the pattern string is matched against the known canonical strings; if it matches, the lexer is used. Otherwise the existing fancy_regex code path is used unchanged.

Token output is byte-identical to vanilla. The optimization changes only how pretokenization splits are produced, not which splits.

Patterns covered

Pattern string	Used by
`PAT_STR_O200K_BASE` (7 alternatives)	GPT-4o, GPT-4.1, GPT-5, o1/o3/o4, gpt-oss-20b/120b
`PAT_STR_CL100K_BASE` (8 alternatives)	GPT-4, GPT-3.5-turbo, text-embedding-ada-002 / -3-*
`PAT_STR_GPT2` (7 alternatives)	GPT-2, davinci-002/003, Codex; also HuggingFace tokenizers' ByteLevel default

Custom regexes passed to CoreBPE::new are routed to fancy_regex unchanged.

Headline result

End-to-end encode_ordinary throughput on Apple M4 base, single-thread, vanilla PyPI tiktoken 0.13.0 vs this patched fork. +56% size-weighted aggregate across 104 MiB of corpus:

File	Vanilla (MiB/s)	Lexer (MiB/s)	Speedup
`synthetic_numeric` (1 MB)	16.4	66.9	4.08×
`code_python` (1 MB)	27.8	89.6	3.22×
`code_python` (15 MB)	26.0	81.7	3.14×
`wiki_en` (15 MB)	31.1	76.9	2.47×
`flores200` (1 MB)	24.4	33.4	1.37×
`flores200` (15 MB)	23.2	31.5	1.36×
`wiki_zh` (1 MB)	18.5	23.1	1.25×
`wiki_zh` (15 MB)	18.3	22.7	1.24×
Aggregate (size-weighted)	23.8	37.2	1.56×

3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU). Measured via Python enc.encode_ordinary(text) on o200k_base. Larger wins on workloads where the pretokenization regex is the dominant cost (English text, code, numeric content); smaller wins on CJK Wikipedia where the merger does more of the work.

How the lexer works

Each canonical pattern's lexer is a forward-scanning state machine with one alt-handler function per regex alternative. The alts are tried in regex order; the first one that matches consumes the input and the loop advances.

Character class checks (\p{L}, \p{N}, \s, the case-aware ones) use a precomputed [u8; 0x110000] Unicode bitmap (~1.1 MB, built once via std::sync::LazyLock), so non-ASCII classification is one array index plus a bit test, same cost as ASCII. The ASCII portion is a separate 128-entry const filled at compile time.

The o200k_base alt-1 handler does explicit backtracking on the greedy [Lu|Lt|Lm|Lo|M]*[Ll|Lm|Lo|M]+ body (the lowercase tail is mandatory; if the greedy uppercase scan consumed too much, the loop shrinks one char at a time). This mirrors fancy_regex's quantifier-then-backtrack semantics. The cl100k and gpt-2 patterns use possessive ++ quantifiers in their regex strings, so no backtracking is needed in their lexers.

Dispatch: at CoreBPE::new_internal, the pattern: &str is matched against three constant strings (PAT_STR_O200K_BASE, PAT_STR_CL100K_BASE, PAT_STR_GPT2). A match stores Some(LexerKind::*) on the CoreBPE struct; unknown patterns keep lexer_kind: None. The internal pretok_splits helper then dispatches via a match, returning either the appropriate lexer's iterator or fancy_regex::find_iter mapped to (start, end) tuples.

Correctness

Three layers of byte-equality verification:

Rust path-equivalence tests (added in this PR, in src/lexer.rs lexer_regex_equivalence module): for each pattern, assert lexer::split*(text) produces the same (start, end) tuples as fancy_regex::Regex::find_iter across a curated fixture set (whitespace edges, contractions case-sensitive vs case-insensitive, numeric {1,3} cap, greedy backtracking on case boundaries, mixed scripts, emoji / non-BMP, code-like punctuation, apostrophes that aren't contractions). 3 new tests, all pass alongside the 2 pre-existing tests.
Python pytest suite: pytest tests/ passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).
Full-corpus byte-equality: across all four built-in encodings (r50k_base, p50k_base, cl100k_base, o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.

Disclosures

~1.1 MB static Unicode classification bitmap, built once per process at first lex call (lazy via LazyLock), shared across all tokenizer instances and threads. Read-only after construction. Invisible cost for typical server-side use.
~256-byte ASCII classification table, built at compile time as a const. No runtime cost.
Custom regexes still use fancy_regex. The lexer is opt-in via exact string match on the pattern passed to CoreBPE::new. Unknown patterns are handled unchanged.
Pattern lock-in: if upstream changes a canonical regex string, the lexer stops matching and falls through to fancy_regex. This is correct-by-default (no silent divergence), but means a new pattern release would need a lexer port to keep the speedup. The byte-equality tests catch any drift between lexer and current canonical pattern in CI.

Backward compatibility

No public API changes. CoreBPE::new signature and behavior are unchanged for users.
Internal CoreBPE gains an optional lexer_kind field (private to module).
One new dependency: unicode-properties = "0.1". Used only at build time of the LazyLock bitmap; pure-Rust, no transitive deps.

Testing

Rust: cargo test passes 5/5 (2 pre-existing + 3 new lexer_regex_equivalence tests covering both regex behaviors per pattern).
Python: pytest tests/ passes 33/33.
Byte-equality: ~238 MiB curated corpus across 4 encodings yields identical token sequences vs vanilla PyPI tiktoken (sha256 hash match on all 128 pairs).

Notes on the perf delta

Per-corpus single-thread speedup numbers for cl100k_base / gpt-2 (16 corpora × 3 patterns) are not shown here; the headline focuses on o200k_base since that's where most current production traffic lives. Happy to provide the full per-corpus / per-pattern / multi-thread breakdown if useful (we ran an N=5 scaling sweep at 1/2/4/8/10 threads across all three patterns).

Multi-thread scaling

encode_ordinary_batch (rayon over the same 8-file batch), same Apple M4 base, same protocol of alternating between vanilla and patched runs:

Threads	Vanilla (MiB/s)	Lexer (MiB/s)	Speedup
1	19.7	35.4	1.80×
2	33.4	66.0	1.98× (peak)
4	46.2	74.2	1.61×
8	44.1	74.7	1.69×
10	46.5	74.0	1.59×

Both implementations plateau around 4 threads. Once pretokenization is no longer the bottleneck, the merger (sequential per piece by construction; HashMap-bound for 3+ byte spans) becomes the new ceiling, and both impls hit it at similar absolute throughput (~75 MiB/s lexer, ~45 MiB/s vanilla). The lexer holds no shared mutable state, so it doesn't add contention; but it can't widen the gap further past the merger-bound regime.

Two notes on this plateau:

The Apple M4 base used here has 4 P-cores + 6 E-cores. Going past 4 threads adds E-cores, which run at roughly half the per-thread throughput of P-cores. Some of the plateau is likely this heterogeneous-core artifact; a homogeneous Linux server would probably show different multi-thread scaling.
The lexer's larger structural win (fancy_regex's internal scratch-buffer contention discussed in this crate's source comments and in encode_ordinary_batch — reproducible multi-second tail stalls on 32-core box (o200k_base, num_threads=8) #530) applies to the pretokenization layer specifically. We measured that layer separately in tiktoken-rs's scaling sweep (5.7× single-thread to 13.9× at 10 threads); for end-to-end encode_ordinary_batch on this codebase, the merger ceiling caps the visible advantage at ~1.6-2× across thread counts.

…ation patterns; custom regexes fall back to fancy_regex

Restores the original encode() behaviour of returning Err(EncodeError) on internal regex errors, which our previous refactor through pretok_splits had turned into a panic. encode_ordinary keeps the panic-on-error semantics of the original upstream mat.unwrap().

augustasm added 3 commits May 22, 2026 19:01

Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokeniz…

6530ce9

…ation patterns; custom regexes fall back to fancy_regex

Lexer: name the two magic numbers (ASCII_BOUNDARY, MAX_DIGIT_RUN)

71a0481

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex#552

Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex#552
augustasio wants to merge 3 commits into
openai:mainfrom
augustasio:lexer-pretokenization

augustasio commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

augustasio commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Patterns covered

Headline result

How the lexer works

Correctness

Disclosures

Backward compatibility

Testing

Notes on the perf delta

Multi-thread scaling

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

augustasio commented May 22, 2026 •

edited

Loading