Skip to content

Repair surrogates in encode_with_unstable (Fixes #541)#553

Open
jbbqqf wants to merge 1 commit into
openai:mainfrom
jbbqqf:fix/541-encode-with-unstable-surrogates
Open

Repair surrogates in encode_with_unstable (Fixes #541)#553
jbbqqf wants to merge 1 commit into
openai:mainfrom
jbbqqf:fix/541-encode-with-unstable-surrogates

Conversation

@jbbqqf
Copy link
Copy Markdown

@jbbqqf jbbqqf commented May 23, 2026

Summary

Fixes #541. encode_with_unstable() historically surfaced a raw
UnicodeEncodeError from the Rust boundary on inputs containing
unmatched surrogate pairs or lone surrogates, while encode() and
encode_ordinary() already accepted the same inputs by repairing them
via UTF-16 surrogatepass. The three methods now share the same
contract.

The change is a small try/except wrapping the existing
self._core_bpe.encode_with_unstable(...) call in tiktoken/core.py,
exactly mirroring the fallback already present at tiktoken/core.py:128-136
for encode().

A regression test is added next to test_encode_surrogate_pairs that
exercises both a split surrogate pair ("👍") and a lone
surrogate ("\ud83d"). It fails on origin/main with
UnicodeEncodeError and passes on this branch.

Reproduce BEFORE/AFTER yourself (copy-paste)

git clone https://github.com/openai/tiktoken.git /tmp/tt-541 && cd /tmp/tt-541
pip install -e . pytest hypothesis

# BEFORE — on origin/main, encode_with_unstable raises UnicodeEncodeError
git checkout main
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('encode lone surrogate ok:', enc.encode('\\ud83d'))
try:
    print(enc.encode_with_unstable('\\ud83d'))
except UnicodeEncodeError as e:
    print('encode_with_unstable FAILED with:', e)
"
# Expected: encode succeeds, encode_with_unstable raises UnicodeEncodeError

# AFTER — on this branch, encode_with_unstable accepts the same input
git fetch https://github.com/jbbqqf/tiktoken.git fix/541-encode-with-unstable-surrogates
git checkout FETCH_HEAD
pip install -e .
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('encode lone surrogate ok:', enc.encode('\\ud83d'))
print('encode_with_unstable ok:', enc.encode_with_unstable('\\ud83d'))
"
# Expected: both calls succeed; encode_with_unstable returns (stable, completions)

What I ran locally

$ pytest tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs -v
tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs PASSED

$ pytest
============== 34 passed in 1.83s ==============

Verified the new test fails on origin/main (before the tiktoken/core.py
patch is applied) with UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d', and passes after the patch.

Edge cases

Input encode (existing) encode_with_unstable (after)
"👍" (well-formed) tokens stable + completions
"👍" (split surrogate pair → 👍 after repair) same as 👍 same as 👍
"\ud83d" (lone high surrogate → after repair) same as same as
"\udc4d" (lone low surrogate → after repair) same as same as
Plain ASCII unchanged unchanged (no fallback path taken)

The fallback only fires inside the except UnicodeEncodeError arm, so
the common ASCII / well-formed UTF-8 hot path is untouched.


PR drafted with assistance from Claude Code (Anthropic). The change was
reviewed manually against tiktoken's source (the new arm mirrors the
existing one at tiktoken/core.py:128-136). The reproducer block above
is the one I used during development; reviewers can paste it verbatim.

encode() and encode_ordinary() already wrap their Rust call in
try/except UnicodeEncodeError and retry against the UTF-16
surrogatepass-repaired text. encode_with_unstable() did not, so it
surfaced a raw UnicodeEncodeError on unmatched surrogate pairs and
lone surrogates that the other two methods accept.

This change mirrors the existing fallback in core.py:128-136 for
encode_with_unstable, and adds a regression test next to
test_encode_surrogate_pairs that exercises both code paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encode_with_unstable does not handle surrogate pairs like encode

1 participant