Repair surrogates in encode_with_unstable (Fixes #541) by jbbqqf · Pull Request #553 · openai/tiktoken

jbbqqf · 2026-05-23T10:03:11Z

Summary

Fixes #541. encode_with_unstable() historically surfaced a raw
UnicodeEncodeError from the Rust boundary on inputs containing
unmatched surrogate pairs or lone surrogates, while encode() and
encode_ordinary() already accepted the same inputs by repairing them
via UTF-16 surrogatepass. The three methods now share the same
contract.

The change is a small try/except wrapping the existing
self._core_bpe.encode_with_unstable(...) call in tiktoken/core.py,
exactly mirroring the fallback already present at tiktoken/core.py:128-136
for encode().

A regression test is added next to test_encode_surrogate_pairs that
exercises both a split surrogate pair ("👍") and a lone
surrogate ("\ud83d"). It fails on origin/main with
UnicodeEncodeError and passes on this branch.

Reproduce BEFORE/AFTER yourself (copy-paste)

git clone https://github.com/openai/tiktoken.git /tmp/tt-541 && cd /tmp/tt-541
pip install -e . pytest hypothesis

# BEFORE — on origin/main, encode_with_unstable raises UnicodeEncodeError
git checkout main
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('encode lone surrogate ok:', enc.encode('\\ud83d'))
try:
    print(enc.encode_with_unstable('\\ud83d'))
except UnicodeEncodeError as e:
    print('encode_with_unstable FAILED with:', e)
"
# Expected: encode succeeds, encode_with_unstable raises UnicodeEncodeError

# AFTER — on this branch, encode_with_unstable accepts the same input
git fetch https://github.com/jbbqqf/tiktoken.git fix/541-encode-with-unstable-surrogates
git checkout FETCH_HEAD
pip install -e .
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
print('encode lone surrogate ok:', enc.encode('\\ud83d'))
print('encode_with_unstable ok:', enc.encode_with_unstable('\\ud83d'))
"
# Expected: both calls succeed; encode_with_unstable returns (stable, completions)

What I ran locally

$ pytest tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs -v
tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs PASSED

$ pytest
============== 34 passed in 1.83s ==============

Verified the new test fails on origin/main (before the tiktoken/core.py
patch is applied) with UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d', and passes after the patch.

Edge cases

Input	`encode` (existing)	`encode_with_unstable` (after)
`"👍"` (well-formed)	tokens	stable + completions
`"👍"` (split surrogate pair → 👍 after repair)	same as 👍	same as 👍
`"\ud83d"` (lone high surrogate → `�` after repair)	same as `�`	same as `�`
`"\udc4d"` (lone low surrogate → `�` after repair)	same as `�`	same as `�`
Plain ASCII	unchanged	unchanged (no fallback path taken)

The fallback only fires inside the except UnicodeEncodeError arm, so
the common ASCII / well-formed UTF-8 hot path is untouched.

PR drafted with assistance from Claude Code (Anthropic). The change was
reviewed manually against tiktoken's source (the new arm mirrors the
existing one at tiktoken/core.py:128-136). The reproducer block above
is the one I used during development; reviewers can paste it verbatim.

encode() and encode_ordinary() already wrap their Rust call in try/except UnicodeEncodeError and retry against the UTF-16 surrogatepass-repaired text. encode_with_unstable() did not, so it surfaced a raw UnicodeEncodeError on unmatched surrogate pairs and lone surrogates that the other two methods accept. This change mirrors the existing fallback in core.py:128-136 for encode_with_unstable, and adds a regression test next to test_encode_surrogate_pairs that exercises both code paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair surrogates in encode_with_unstable (Fixes #541)#553

Repair surrogates in encode_with_unstable (Fixes #541)#553
jbbqqf wants to merge 1 commit into
openai:mainfrom
jbbqqf:fix/541-encode-with-unstable-surrogates

jbbqqf commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbbqqf commented May 23, 2026

Summary

Reproduce BEFORE/AFTER yourself (copy-paste)

What I ran locally

Edge cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant