Skip to content

Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex#552

Open
augustasio wants to merge 3 commits into
openai:mainfrom
augustasio:lexer-pretokenization
Open

Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex#552
augustasio wants to merge 3 commits into
openai:mainfrom
augustasio:lexer-pretokenization

Conversation

@augustasio
Copy link
Copy Markdown

@augustasio augustasio commented May 22, 2026

Summary

Replaces fancy_regex::Regex::find_iter with a hand-coded state-machine lexer for the three canonical OpenAI BPE pretokenization patterns. At CoreBPE::new the pattern string is matched against the known canonical strings; if it matches, the lexer is used. Otherwise the existing fancy_regex code path is used unchanged.

Token output is byte-identical to vanilla. The optimization changes only how pretokenization splits are produced, not which splits.

Patterns covered

Pattern string Used by
PAT_STR_O200K_BASE (7 alternatives) GPT-4o, GPT-4.1, GPT-5, o1/o3/o4, gpt-oss-20b/120b
PAT_STR_CL100K_BASE (8 alternatives) GPT-4, GPT-3.5-turbo, text-embedding-ada-002 / -3-*
PAT_STR_GPT2 (7 alternatives) GPT-2, davinci-002/003, Codex; also HuggingFace tokenizers' ByteLevel default

Custom regexes passed to CoreBPE::new are routed to fancy_regex unchanged.

Headline result

End-to-end encode_ordinary throughput on Apple M4 base, single-thread, vanilla PyPI tiktoken 0.13.0 vs this patched fork. +56% size-weighted aggregate across 104 MiB of corpus:

File Vanilla (MiB/s) Lexer (MiB/s) Speedup
synthetic_numeric (1 MB) 16.4 66.9 4.08×
code_python (1 MB) 27.8 89.6 3.22×
code_python (15 MB) 26.0 81.7 3.14×
wiki_en (15 MB) 31.1 76.9 2.47×
flores200 (1 MB) 24.4 33.4 1.37×
flores200 (15 MB) 23.2 31.5 1.36×
wiki_zh (1 MB) 18.5 23.1 1.25×
wiki_zh (15 MB) 18.3 22.7 1.24×
Aggregate (size-weighted) 23.8 37.2 1.56×

3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU). Measured via Python enc.encode_ordinary(text) on o200k_base. Larger wins on workloads where the pretokenization regex is the dominant cost (English text, code, numeric content); smaller wins on CJK Wikipedia where the merger does more of the work.

How the lexer works

Each canonical pattern's lexer is a forward-scanning state machine with one alt-handler function per regex alternative. The alts are tried in regex order; the first one that matches consumes the input and the loop advances.

Character class checks (\p{L}, \p{N}, \s, the case-aware ones) use a precomputed [u8; 0x110000] Unicode bitmap (~1.1 MB, built once via std::sync::LazyLock), so non-ASCII classification is one array index plus a bit test, same cost as ASCII. The ASCII portion is a separate 128-entry const filled at compile time.

The o200k_base alt-1 handler does explicit backtracking on the greedy [Lu|Lt|Lm|Lo|M]*[Ll|Lm|Lo|M]+ body (the lowercase tail is mandatory; if the greedy uppercase scan consumed too much, the loop shrinks one char at a time). This mirrors fancy_regex's quantifier-then-backtrack semantics. The cl100k and gpt-2 patterns use possessive ++ quantifiers in their regex strings, so no backtracking is needed in their lexers.

Dispatch: at CoreBPE::new_internal, the pattern: &str is matched against three constant strings (PAT_STR_O200K_BASE, PAT_STR_CL100K_BASE, PAT_STR_GPT2). A match stores Some(LexerKind::*) on the CoreBPE struct; unknown patterns keep lexer_kind: None. The internal pretok_splits helper then dispatches via a match, returning either the appropriate lexer's iterator or fancy_regex::find_iter mapped to (start, end) tuples.

Correctness

Three layers of byte-equality verification:

  1. Rust path-equivalence tests (added in this PR, in src/lexer.rs lexer_regex_equivalence module): for each pattern, assert lexer::split*(text) produces the same (start, end) tuples as fancy_regex::Regex::find_iter across a curated fixture set (whitespace edges, contractions case-sensitive vs case-insensitive, numeric {1,3} cap, greedy backtracking on case boundaries, mixed scripts, emoji / non-BMP, code-like punctuation, apostrophes that aren't contractions). 3 new tests, all pass alongside the 2 pre-existing tests.
  2. Python pytest suite: pytest tests/ passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).
  3. Full-corpus byte-equality: across all four built-in encodings (r50k_base, p50k_base, cl100k_base, o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.

Disclosures

  • ~1.1 MB static Unicode classification bitmap, built once per process at first lex call (lazy via LazyLock), shared across all tokenizer instances and threads. Read-only after construction. Invisible cost for typical server-side use.
  • ~256-byte ASCII classification table, built at compile time as a const. No runtime cost.
  • Custom regexes still use fancy_regex. The lexer is opt-in via exact string match on the pattern passed to CoreBPE::new. Unknown patterns are handled unchanged.
  • Pattern lock-in: if upstream changes a canonical regex string, the lexer stops matching and falls through to fancy_regex. This is correct-by-default (no silent divergence), but means a new pattern release would need a lexer port to keep the speedup. The byte-equality tests catch any drift between lexer and current canonical pattern in CI.

Backward compatibility

  • No public API changes. CoreBPE::new signature and behavior are unchanged for users.
  • Internal CoreBPE gains an optional lexer_kind field (private to module).
  • One new dependency: unicode-properties = "0.1". Used only at build time of the LazyLock bitmap; pure-Rust, no transitive deps.

Testing

  • Rust: cargo test passes 5/5 (2 pre-existing + 3 new lexer_regex_equivalence tests covering both regex behaviors per pattern).
  • Python: pytest tests/ passes 33/33.
  • Byte-equality: ~238 MiB curated corpus across 4 encodings yields identical token sequences vs vanilla PyPI tiktoken (sha256 hash match on all 128 pairs).

Notes on the perf delta

Per-corpus single-thread speedup numbers for cl100k_base / gpt-2 (16 corpora × 3 patterns) are not shown here; the headline focuses on o200k_base since that's where most current production traffic lives. Happy to provide the full per-corpus / per-pattern / multi-thread breakdown if useful (we ran an N=5 scaling sweep at 1/2/4/8/10 threads across all three patterns).

Multi-thread scaling

encode_ordinary_batch (rayon over the same 8-file batch), same Apple M4 base, same protocol of alternating between vanilla and patched runs:

Threads Vanilla (MiB/s) Lexer (MiB/s) Speedup
1 19.7 35.4 1.80×
2 33.4 66.0 1.98× (peak)
4 46.2 74.2 1.61×
8 44.1 74.7 1.69×
10 46.5 74.0 1.59×

Both implementations plateau around 4 threads. Once pretokenization is no longer the bottleneck, the merger (sequential per piece by construction; HashMap-bound for 3+ byte spans) becomes the new ceiling, and both impls hit it at similar absolute throughput (~75 MiB/s lexer, ~45 MiB/s vanilla). The lexer holds no shared mutable state, so it doesn't add contention; but it can't widen the gap further past the merger-bound regime.

Two notes on this plateau:

  • The Apple M4 base used here has 4 P-cores + 6 E-cores. Going past 4 threads adds E-cores, which run at roughly half the per-thread throughput of P-cores. Some of the plateau is likely this heterogeneous-core artifact; a homogeneous Linux server would probably show different multi-thread scaling.
  • The lexer's larger structural win (fancy_regex's internal scratch-buffer contention discussed in this crate's source comments and in encode_ordinary_batch — reproducible multi-second tail stalls on 32-core box (o200k_base, num_threads=8) #530) applies to the pretokenization layer specifically. We measured that layer separately in tiktoken-rs's scaling sweep (5.7× single-thread to 13.9× at 10 threads); for end-to-end encode_ordinary_batch on this codebase, the merger ceiling caps the visible advantage at ~1.6-2× across thread counts.

augustasm added 3 commits May 22, 2026 19:01
…ation patterns; custom regexes fall back to fancy_regex
Restores the original encode() behaviour of returning Err(EncodeError) on
internal regex errors, which our previous refactor through pretok_splits had
turned into a panic. encode_ordinary keeps the panic-on-error semantics of
the original upstream mat.unwrap().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant