Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551
Open
augustasio wants to merge 2 commits into
Open
Add 2-byte pair lookup table to skip hashmap for initial BPE merge scan#551augustasio wants to merge 2 commits into
augustasio wants to merge 2 commits into
Conversation
97876da to
529021b
Compare
Covers both code paths: linear `_byte_pair_merge` (pieces < 100 bytes) and heap-based `_byte_pair_merge_large` (pieces >= 100 bytes), with edge cases straddling the 100-byte cutoff.
529021b to
4d95458
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a precomputed 2-byte pair to rank table (
Box<[Rank; 65536]>, ~256 KB) onCoreBPE, used by the encoding methods to skip the hashmap lookup for the hot initial adjacent-pair scan in both_byte_pair_merge(linear, pieces < 100 bytes) and_byte_pair_merge_large(heap-based, pieces ≥ 100 bytes). Subsequent merges (3+ byte spans) still go through the encoder hashmap; those keys don't fit in a u16 anyway.Token output is byte-identical to vanilla. The optimization changes only how the initial 2-byte pair scan is implemented.
Headline result
+6.2% size-weighted aggregate
encode_ordinarythroughput across 104 MiB of corpus on this codebase. Biggest wins on CJK Wikipedia (+8.6% to +13.8%) and FLORES multilingual (+7.8% to +7.9%) where the merger dominates encode time. Slight regression on dense numeric content (-1.3%) and roughly flat on Python source (-0.8% to -3.0%).Apple M4 base (4P + 6E cores), single-thread
encode_ordinaryvia Python wrapper, 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU) across vanilla (PyPI tiktoken 0.13.0) and this patched branch to control thermal drift.Full per-corpus table in the appendix.
Correctness
Two layers of byte-equality verification:
pair_table_equivalencemodule): assertbyte_pair_encodeandbyte_pair_encode_with_tableproduce identical output for piece lengths spanning the 100-byte linear-vs-heap dispatch cutoff (1, 2, 50, 98, 99, 100, 101, 200, 500, 1000), plus repeated-pair stress tests. 6 new tests, all pass alongside the 2 pre-existing tests.r50k_base,p50k_base,cl100k_base,o200k_base) on ~238 MiB of text (32 files: multilingual Wikipedia, code samples, FLORES-200, procedural numeric / repetition / long-piece stress). Token sequence sha256 matches between vanilla PyPI tiktoken and this patched fork on all 128 (encoding × file) pairs.Disclosures
CoreBPEinstance, built once at construction. Read-only after init, so Linux fork-based serving keeps it copy-on-write shared. For typical use (one tokenizer per process) the cost is invisible; multi-tokenizer-per-process systems should know.Backward compatibility
pair_tablefield onCoreBPEis private to the module. External code constructs viaCoreBPE::newas before; no API change visible to users._byte_pair_mergeand_byte_pair_merge_largegain an optionalpair_table: Option<&[Rank; 65536]>parameter. Both are private functions; signature change is internal only.byte_pair_encodeandbyte_pair_split(public) keep their existing signatures and passNoneinternally, preserving back-compat for external Rust users.byte_pair_encode_with_tableis added (parallel tobyte_pair_encode) forCoreBPEmethods to use without changing the public function.Testing
cargo testpasses 8/8 tests (2 pre-existing + 6 new path-equivalence tests covering both the linear and heap merge paths).pytest tests/passes 33/33 (hypothesis-based property tests on r50k_base + cl100k_base, roundtrip tests, batch encoding, catastrophic-repetition stress, special tokens, pickling).Appendix
A. Methodology
Apple M4 base (4P + 6E cores). Single-thread
encode_ordinaryvia tiktoken's Python wrapper (which is itself a thin PyO3 binding around the RustCoreBPE::encode_ordinary). 3 rounds × 5 iterations per round per file, alternating between vanilla and patched runs (to control for thermal drift on the CPU) between vanilla and patched to control thermal drift. Median across iterations within a round, then median across rounds; aggregate issum(bytes) / sum(median_times).Vanilla side: PyPI
tiktoken==0.13.0(installed from the package index in a clean venv).Patched side: editable install of this branch (
pip install -e .).Byte-equality validated by encoding the full ~238 MiB curated corpus with both implementations and verifying token sequence sha256 hashes match across all (encoding × file) pairs.
B. Corpora used in this PR's measurements
8 files for the perf bench (selected to cover diverse content):
wiki_zhflores200code_pythonwiki_ensynthetic_numericThe byte-equality run used a larger 32-file corpus: 16 files at 1 MB each (the per-file content listed above, plus code_go/javascript/rust, FLORES, wiki_ar/hi/ja/ru, wikitext_103, plus 4 procedural stress corpora: stress_doc, synthetic_numeric, merger_long_pieces, merger_repeated_pieces) and the 15 MB equivalents.
C. Procedural corpora (detail + samples)
The four procedural corpora are generated locally. Each targets a specific algorithmic regime.
stress_doc: mixed-content document (random words, numbers, punctuation) with no paragraph breaks. Probes the long-input regime where fancy_regex can hit catastrophic-backtracking patterns.synthetic_numeric: procedurally generated CSV rows, JSON logs, hex dumps, ISO timestamps, IPv4 addresses, and UUIDs.merger_long_pieces: random alphanumeric runs of 100 to 500 bytes per piece. Hits the_byte_pair_merge_largeheap path.merger_repeated_pieces: ~10 unique pieces of 20 to 50 bytes, sampled with repetition.D. Per-file throughput (single-thread
encode_ordinary, o200k_base)Median across 3 rounds × 5 iterations per round, alternating between vanilla and patched runs (to control for thermal drift on the CPU).
wiki_zh(1 MB)flores200(1 MB)wiki_zh(15 MB)flores200(15 MB)wiki_en(15 MB)code_python(15 MB)synthetic_numeric(1 MB)code_python(1 MB)Slight regression on dense numeric content is expected: most 2-byte digit pairs aren't in the encoder vocab (
Rank::MAXin the table), so the table-lookup work doesn't avoid as many useful hashmap calls. Code is dominated bybyte_pair_encode_with_tablecases that bypass the hot path (pieces matching tokens directly), so the win is small there too.E. Construction-time overhead (per-encoding)
Measured separately, criterion 100 samples × 5 s on Apple M4:
o200k_basecl100k_basep50k_baser50k_baseScales linearly with vocab size (one pass over the encoder entries). Sub-millisecond on the largest vocab.