Skip to content

fix(py): make encode_with_unstable handle surrogates like encode (#541)#556

Open
LeSingh1 wants to merge 1 commit into
openai:mainfrom
LeSingh1:fix/encode-with-unstable-surrogates
Open

fix(py): make encode_with_unstable handle surrogates like encode (#541)#556
LeSingh1 wants to merge 1 commit into
openai:mainfrom
LeSingh1:fix/encode-with-unstable-surrogates

Conversation

@LeSingh1
Copy link
Copy Markdown

Closes #541.

Encoding.encode and Encoding.encode_ordinary already catch UnicodeEncodeError raised by the Rust BPE when the input string contains surrogate pairs or lone surrogates, then retry after a UTF-16 surrogatepass + replace round-trip. Encoding.encode_with_unstable did not, so the same inputs that worked through encode raised UnicodeEncodeError here.

This PR mirrors the same try / except / repair pattern so the three encode paths agree on what inputs they accept.

Repro (against current main)

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

enc.encode("👍")               # works, returns [9468, 239, 235]
enc.encode_with_unstable("👍") # raises UnicodeEncodeError

Verification

Added test_encode_with_unstable_surrogate_pairs in tests/test_encoding.py. The test fails on main with the exact UnicodeEncodeError quoted in #541 and passes on this branch. I confirmed locally:

without fix: FAILED tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs - UnicodeEncodeError
with fix:    PASSED tests/test_encoding.py::test_encode_with_unstable_surrogate_pairs

The new test cross-checks the stable prefix against enc.encode(...) to ensure both paths agree on the resulting tokens, not just that no exception is raised.

Scope

Single-file change in tiktoken/core.py (8-line try / except wrap mirroring encode's) plus one regression test.

…nai#541)

encode and encode_ordinary already catch UnicodeEncodeError raised by
the Rust BPE when the input string contains surrogate pairs / lone
surrogates, then retry after a UTF-16 "surrogatepass" + "replace"
round-trip. encode_with_unstable did not, so the same inputs that
worked through encode raised UnicodeEncodeError here.

Mirror the same try / except / repair pattern so the three encode
paths agree on what inputs they accept.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encode_with_unstable does not handle surrogate pairs like encode

1 participant