Add lexer fast path for o200k_base, cl100k_base, and gpt-2 pretokenization patterns; custom regexes fall back to fancy_regex#552
Open
augustasio wants to merge 3 commits into
Conversation
added 3 commits
May 22, 2026 19:01
…ation patterns; custom regexes fall back to fancy_regex
Restores the original encode() behaviour of returning Err(EncodeError) on internal regex errors, which our previous refactor through pretok_splits had turned into a panic. encode_ordinary keeps the panic-on-error semantics of the original upstream mat.unwrap().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces
fancy_regex::Regex::find_iterwith a hand-coded state-machine lexer for the three canonical OpenAI BPE pretokenization patterns. AtCoreBPE::newthe pattern string is matched against the known canonical strings; if it matches, the lexer is used. Otherwise the existingfancy_regexcode path is used unchanged.Token output is byte-identical to vanilla. The optimization changes only how pretokenization splits are produced, not which splits.
Patterns covered
PAT_STR_O200K_BASE(7 alternatives)PAT_STR_CL100K_BASE(8 alternatives)PAT_STR_GPT2(7 alternatives)Custom regexes passed to
CoreBPE::neware routed tofancy_regexunchanged.Headline result
End-to-end
encode_ordinarythroughput on Apple M4 base, single-thread, vanilla PyPI tiktoken 0.13.0 vs this patched fork. +56% size-weighted aggregate across 104 MiB of corpus:synthetic_numeric(1 MB)code_python(1 MB)code_python(15 MB)wiki_en(15 MB)flores200(1 MB)flores200(15 MB)wiki_zh(1 MB)wiki_zh(15 MB)3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU). Measured via Python
enc.encode_ordinary(text)on o200k_base. Larger wins on workloads where the pretokenization regex is the dominant cost (English text, code, numeric content); smaller wins on CJK Wikipedia where the merger does more of the work.How the lexer works
Each canonical pattern's lexer is a forward-scanning state machine with one alt-handler function per regex alternative. The alts are tried in regex order; the first one that matches consumes the input and the loop advances.
Character class checks (
\p{L},\p{N},\s, the case-aware ones) use a precomputed[u8; 0x110000]Unicode bitmap (~1.1 MB, built once viastd::sync::LazyLock), so non-ASCII classification is one array index plus a bit test, same cost as ASCII. The ASCII portion is a separate 128-entryconstfilled at compile time.The
o200k_basealt-1 handler does explicit backtracking on the greedy[Lu|Lt|Lm|Lo|M]*[Ll|Lm|Lo|M]+body (the lowercase tail is mandatory; if the greedy uppercase scan consumed too much, the loop shrinks one char at a time). This mirrors fancy_regex's quantifier-then-backtrack semantics. Thecl100kandgpt-2patterns use possessive++quantifiers in their regex strings, so no backtracking is needed in their lexers.Dispatch: at
CoreBPE::new_internal, thepattern: &stris matched against three constant strings (PAT_STR_O200K_BASE,PAT_STR_CL100K_BASE,PAT_STR_GPT2). A match storesSome(LexerKind::*)on theCoreBPEstruct; unknown patterns keeplexer_kind: None. The internalpretok_splitshelper then dispatches via amatch, returning either the appropriate lexer's iterator orfancy_regex::find_itermapped to(start, end)tuples.Correctness
Three layers of byte-equality verification:
src/lexer.rslexer_regex_equivalencemodule): for each pattern, assertlexer::split*(text)produces the same(start, end)tuples asfancy_regex::Regex::find_iteracross a curated fixture set (whitespace edges, contractions case-sensitive vs case-insensitive, numeric{1,3}cap, greedy backtracking on case boundaries, mixed scripts, emoji / non-BMP, code-like punctuation, apostrophes that aren't contractions). 3 new tests, all pass alongside the 2 pre-existing tests.pytest tests/passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).r50k_base,p50k_base,cl100k_base,o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.Disclosures
LazyLock), shared across all tokenizer instances and threads. Read-only after construction. Invisible cost for typical server-side use.const. No runtime cost.fancy_regex. The lexer is opt-in via exact string match on the pattern passed toCoreBPE::new. Unknown patterns are handled unchanged.fancy_regex. This is correct-by-default (no silent divergence), but means a new pattern release would need a lexer port to keep the speedup. The byte-equality tests catch any drift between lexer and current canonical pattern in CI.Backward compatibility
CoreBPE::newsignature and behavior are unchanged for users.CoreBPEgains an optionallexer_kindfield (private to module).unicode-properties = "0.1". Used only at build time of theLazyLockbitmap; pure-Rust, no transitive deps.Testing
cargo testpasses 5/5 (2 pre-existing + 3 newlexer_regex_equivalencetests covering both regex behaviors per pattern).pytest tests/passes 33/33.Notes on the perf delta
Per-corpus single-thread speedup numbers for cl100k_base / gpt-2 (16 corpora × 3 patterns) are not shown here; the headline focuses on o200k_base since that's where most current production traffic lives. Happy to provide the full per-corpus / per-pattern / multi-thread breakdown if useful (we ran an N=5 scaling sweep at 1/2/4/8/10 threads across all three patterns).
Multi-thread scaling
encode_ordinary_batch(rayon over the same 8-file batch), same Apple M4 base, same protocol of alternating between vanilla and patched runs:Both implementations plateau around 4 threads. Once pretokenization is no longer the bottleneck, the merger (sequential per piece by construction; HashMap-bound for 3+ byte spans) becomes the new ceiling, and both impls hit it at similar absolute throughput (~75 MiB/s lexer, ~45 MiB/s vanilla). The lexer holds no shared mutable state, so it doesn't add contention; but it can't widen the gap further past the merger-bound regime.
Two notes on this plateau:
fancy_regex's internal scratch-buffer contention discussed in this crate's source comments and in encode_ordinary_batch — reproducible multi-second tail stalls on 32-core box (o200k_base, num_threads=8) #530) applies to the pretokenization layer specifically. We measured that layer separately in tiktoken-rs's scaling sweep (5.7× single-thread to 13.9× at 10 threads); for end-to-endencode_ordinary_batchon this codebase, the merger ceiling caps the visible advantage at ~1.6-2× across thread counts.