perf: speed up batch tokenization and decoding hot paths#561
Open
eonr wants to merge 10 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR moves the expensive Python coordination paths for common LLM tokenization workloads into native Rust-backed paths:
encode_batchandencode_ordinary_batchdecode_batchanddecode_bytes_batchdecode_tokens_bytesanddecode_with_offsetsCoreBPEand lazily materializing publicmergeable_ranksThe main pattern is the same across the changes: keep the public Python API and fallback behavior intact, but avoid per-item Python futures, repeated dict materialization, and repeated regex construction in hot paths.
Benchmarks
Local macOS arm64 workstation, Python 3.14.2. Each row compares current
mainagainst this PR using best-of-7 runs unless noted.encode_batchencode_ordinary_batchencode_batchencode_batchdecode_batchdecode_bytes_batchdecode_batchdecode_bytes_batchdecode_with_offsetsdecode_with_offsetsencode_batchCold public encoding construction also improves:
get_encodingget_encodingcl100k_baseo200k_baseo200k_harmonyBenchmark commands:
Validation
python -m py_compile scripts/benchmark_batch_encoding.py scripts/benchmark_batch_decoding.py scripts/benchmark_special_encoding.py scripts/benchmark_token_decoding.py cargo fmt --check git diff --check cargo test -q TIKTOKEN_MAX_EXAMPLES=1000 python -m pytest tests --import-mode=append -q check-manifest -v python -m build --sdist --wheelLocal results:
check-manifest: version-control and sdist file lists matchpython -m build --sdist --wheel: builttiktoken-0.13.0.tar.gzand a local macOS arm64 wheelsite-packages,get_encoding,encode_batch,decode_batch, anddecode_with_offsetspassedCompatibility notes
The native paths are guarded and fall back to the existing per-string Python behavior for unsupported iterables, surrogate-containing inputs, or unexpected type-conversion failures. Public
Encoding._mergeable_ranksstill behaves like a dict when accessed; public encodings just defer materializing that dict until a caller actually needs it.