Repair surrogates in encode_with_unstable (Fixes #541)#553
Open
jbbqqf wants to merge 1 commit into
Open
Conversation
encode() and encode_ordinary() already wrap their Rust call in try/except UnicodeEncodeError and retry against the UTF-16 surrogatepass-repaired text. encode_with_unstable() did not, so it surfaced a raw UnicodeEncodeError on unmatched surrogate pairs and lone surrogates that the other two methods accept. This change mirrors the existing fallback in core.py:128-136 for encode_with_unstable, and adds a regression test next to test_encode_surrogate_pairs that exercises both code paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #541.
encode_with_unstable()historically surfaced a rawUnicodeEncodeErrorfrom the Rust boundary on inputs containingunmatched surrogate pairs or lone surrogates, while
encode()andencode_ordinary()already accepted the same inputs by repairing themvia UTF-16
surrogatepass. The three methods now share the samecontract.
The change is a small
try/exceptwrapping the existingself._core_bpe.encode_with_unstable(...)call intiktoken/core.py,exactly mirroring the fallback already present at
tiktoken/core.py:128-136for
encode().A regression test is added next to
test_encode_surrogate_pairsthatexercises both a split surrogate pair (
"👍") and a lonesurrogate (
"\ud83d"). It fails onorigin/mainwithUnicodeEncodeErrorand passes on this branch.Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
Verified the new test fails on
origin/main(before thetiktoken/core.pypatch is applied) with
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d', and passes after the patch.Edge cases
encode(existing)encode_with_unstable(after)"👍"(well-formed)"👍"(split surrogate pair → 👍 after repair)"\ud83d"(lone high surrogate →�after repair)��"\udc4d"(lone low surrogate →�after repair)��The fallback only fires inside the
except UnicodeEncodeErrorarm, sothe common ASCII / well-formed UTF-8 hot path is untouched.
PR drafted with assistance from Claude Code (Anthropic). The change was
reviewed manually against tiktoken's source (the new arm mirrors the
existing one at
tiktoken/core.py:128-136). The reproducer block aboveis the one I used during development; reviewers can paste it verbatim.