Skip to content

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880

Open
olegsa wants to merge 4 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection
Open

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
olegsa wants to merge 4 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection

Conversation

@olegsa
Copy link
Copy Markdown

@olegsa olegsa commented Apr 9, 2026

Summary

  • Add automatic character encoding detection to mcp-server-fetch using charset-normalizer for pages that don't declare charset in the HTTP Content-Type header
  • Introduces get_response_text() helper that checks response.charset_encoding first, then falls back to statistical byte analysis via charset-normalizer
  • Fixes garbled text when fetching pages served in non-UTF-8 encodings (e.g. windows-1251, windows-1255, windows-1256, euc-kr) without a charset declaration in the HTTP header

Motivation

Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP Content-Type header. The current code uses response.text which defaults to UTF-8, producing garbled/mojibake output for these pages.

Changes

  • server.py: Added get_response_text() that uses charset-normalizer for encoding detection when HTTP headers lack charset info. Replaced response.text with get_response_text(response) in fetch_url(). Also moved httpx from local imports to a top-level import.
  • pyproject.toml: Added charset-normalizer>=3.0.0 as an explicit dependency (already a transitive dep via requests).
  • tests/test_server.py: Added TestGetResponseText class with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.

Test plan

  • All existing tests pass
  • New encoding detection tests pass for Ukrainian, Hebrew, Arabic, Korean
  • UTF-8 pages with charset in HTTP header still use the standard path
  • Non-HTML content (JSON, etc.) is unaffected

Made with Cursor

olegsa added 4 commits April 9, 2026 10:32
…tests

Add charset-normalizer as an explicit dependency in pyproject.toml and
add tests verifying correct decoding of non-UTF-8 pages (Ukrainian
windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant