Fix mojibake on Cyrillic subtitles (CP1251 + Wyzie broken UTF-8)#89
Closed
nevatas wants to merge 2 commits into
Closed
Fix mojibake on Cyrillic subtitles (CP1251 + Wyzie broken UTF-8)#89nevatas wants to merge 2 commits into
nevatas wants to merge 2 commits into
Conversation
Embed players (Videasy, VidSrc, 2Embed) fetch subtitle files directly
from third-party CDNs. Two problems made Cyrillic subtitles unreadable:
1. Some CDNs serve files in legacy single-byte encodings (CP1251 for
Russian) without a charset header. Browsers fall back to UTF-8 /
Latin-1 and render mojibake.
2. sub.wyzie.io converts CP1251 to "UTF-8" with a broken pipeline that
mixes CP1252/CP1256 char tables. The result is valid UTF-8 but reads
as Arabic + Latin-1 supplement gibberish instead of Cyrillic.
Add a small loopback HTTP server (src/ipc/subtitleProxy.js) that fetches
each subtitle file, detects/repairs the encoding (incl. reversing the
Wyzie mangle by mapping each char back through CP1252/CP1256 reverse
tables and decoding the resulting byte stream as CP1251), and re-serves
the file as UTF-8. main.js redirects subtitle requests in persist:player
through the proxy via webRequest, which can only redirect — not rewrite
response bodies.
The Wyzie fix triggers only on the unique mojibake signature (Arabic AND
Latin-1 supplement chars in the same file, no real Cyrillic) and is
validated post-recovery (≥30% Cyrillic chars), so legitimate Arabic,
French, etc. subtitles pass through unchanged.
Covers all CP1251-encoded Cyrillic languages: Russian, Ukrainian,
Belarusian, Bulgarian, Serbian (Cyrillic), Macedonian.
Also widens the playerSession URL filter to catch .srt files and Wyzie's
/c/<hash>/id/<id> path (no extension), and narrows the Wyzie pattern to
/c/* so the /search endpoint isn't dragged into the media-only branch
(which would cancel it as if it were an ad).
CodeQL flagged the fetch() call as a server-side request forgery sink (github-advanced-security truelockmc#13) because the target URL is reconstructed from a base64 query param. The proxy server binds to 127.0.0.1 only, but any code that can reach it (any embed iframe in the player session) could pivot through our Node process to scan the user's loopback / LAN or hit cloud-metadata services. Add isPrivateHostname() that rejects loopback / link-local / RFC1918 IPv4, IPv6 ULA + link-local + IPv4-mapped loopback, plus *.local / *.internal / *.localhost hostnames. Apply it both to the initial target URL and to every redirect hop (manual redirect handling, so a public CDN can't 30x us into a private address either). Validated locally: ✓ 400 ← http://127.0.0.1:22/ ✓ 400 ← http://localhost:8080/ ✓ 400 ← http://169.254.169.254/ (cloud metadata) ✓ 400 ← http://192.168.1.1/ ✓ 200 ← http://example.com/ (public, still works)
Author
|
@truelockmc — quick heads-up on the CodeQL SSRF alert above: Commit a01d356 addresses it. The original review now shows as Outdated because it pointed at the old
The PR is currently blocked on "1 workflow awaiting approval" — could you approve the workflow run so CodeQL can re-scan? It should clear the alert against |
Comment on lines
+72
to
+79
| const res = await fetch(current, { | ||
| redirect: "manual", | ||
| headers: { | ||
| "User-Agent": | ||
| "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0", | ||
| ...(referer ? { Referer: referer } : {}), | ||
| }, | ||
| }); |
vanthanhnguyen260696-prog
approved these changes
May 22, 2026
vanthanhnguyen260696-prog
approved these changes
May 22, 2026
Owner
|
This is kind of suspicious. And the PR has security problems?! I think I will close this for now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What you'll see today
Pick any movie/episode that has Russian (or another Cyrillic) subtitles in Videasy / VidSrc / 2Embed. The subtitle overlay renders as garbage instead of Cyrillic — either Latin-1 mojibake (
âïˆâûˆ ïî_âèëàü) or a mix of Arabic + Latin (انهٌü يهٍ), depending on which mangling the source CDN applied.Root cause
Two distinct, but overlapping, issues in the third-party subtitle CDNs the embed players load files from:
Legacy single-byte encodings without a charset. Several CDNs still serve Russian subtitles as raw CP1251 bytes with
Content-Type: text/plain(nocharset=). Browsers fall back to UTF-8 / Latin-1 and render mojibake.sub.wyzie.iohas a broken CP1251 → UTF-8 converter. Itsencoding=UTF-8pipeline interprets each source byte using a mix of CP1252 (for the 0xA0-0xFF range) and CP1256 (which “wins” for bytes that overlap with the Arabic block in 0xC0-0xDF). The output is valid UTF-8 but reads as Arabic + Latin-1-supplement gibberish instead of Cyrillic. Re-requesting withencoding=cp1251/encoding=windows-1251/ no encoding /encoding=raw/encoding=originalall return the exact same broken bytes — there is no upstream parameter we can flip to bypass it.Verified by fetching a real Russian sub URL and reversing the mangling byte-by-byte:
That recovery is exactly what this PR does at runtime.
What this PR does
webRequest.onBeforeRequestin Electron can only redirect or cancel a request — it cannot rewrite the response body. To get UTF-8 bytes into the embed iframe, the PR introduces a tiny loopback HTTP server (same pattern as the existing_playerServerinsrc/ipc/allmanga.js) that:Content-Typeand a forcedcharset=utf-8.main.jsregisters the proxy onapp.whenReady()and redirects*.vtt,*.srt, andwyzie.{io,ru}/c/*requests inpersist:playerthrough it.Coverage
Works for all CP1251-encoded Cyrillic languages: Russian, Ukrainian, Belarusian, Bulgarian, Serbian (Cyrillic), Macedonian.
Files changed
src/ipc/subtitleProxy.js(new, ~210 lines) — the loopback transcoder + Wyzie de-mangle.index.js— register the proxy at startup; widen the player session URL filter to include.srtand the Wyzie/c/*path; redirect detected subtitle requests through the proxy. The Wyzie pattern is intentionally narrowed to/c/*so the unrelated/searchendpoint isn't dragged into the media-only branch (where it would be cancelled as if it were an ad).How to verify
Before this PR: subtitle overlay shows
_ُيَآè û Ĵ_ ٯو_ُ àçñôëàé!orâïˆâûˆ ïî_âèëàü.After this PR: subtitle overlay shows the actual Russian translation (
Здесь нет.,Где же ты?, etc.).Regression-tested against the same Naruto S1E1 episode for Russian (broken→fixed), English, Portuguese, French (with diacritics), and Arabic (real Arabic file) — only the Russian file gets touched; everything else passes through bit-for-bit.
Notes
httpand built-inTextDecoder/fetch.