gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco… #136639

vedant713 · 2025-07-14T01:26:03Z

The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.

This commit introduces a normalize_surrogates() helper in Reader to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The get_unicode() method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.

This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.

Fixes #136595

Issue: Unicode characters ≥ 0x10000 cannot be inputted/behaves unusually at the REPL terminal. #136595

…deEncodeError on Windows The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues. This commit introduces a `normalize_surrogates()` helper in `Reader` to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The `get_unicode()` method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text. This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows. Fixes python#136595

bedevere-app · 2025-07-14T01:26:08Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app bot mentioned this pull request Jul 14, 2025

Unicode characters ≥ 0x10000 cannot be inputted/behaves unusually at the REPL terminal. #136595

Open

blurb-it bot and others added 2 commits July 14, 2025 01:27

📜🤖 Added by blurb_it.

a567845

Update reader.py

7a31a1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco… #136639

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco… #136639

vedant713 commented Jul 14, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

bedevere-app bot commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco… #136639

Are you sure you want to change the base?

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco… #136639

Conversation

vedant713 commented Jul 14, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Jul 14, 2025

Uh oh!

Uh oh!

vedant713 commented Jul 14, 2025 •

edited by bedevere-app bot

Loading