Skip to content

Fix mojibake on Cyrillic subtitles (CP1251 + Wyzie broken UTF-8)#89

Closed
nevatas wants to merge 2 commits into
truelockmc:mainfrom
nevatas:fix/cyrillic-subtitle-encoding
Closed

Fix mojibake on Cyrillic subtitles (CP1251 + Wyzie broken UTF-8)#89
nevatas wants to merge 2 commits into
truelockmc:mainfrom
nevatas:fix/cyrillic-subtitle-encoding

Conversation

@nevatas
Copy link
Copy Markdown

@nevatas nevatas commented May 19, 2026

What you'll see today

Pick any movie/episode that has Russian (or another Cyrillic) subtitles in Videasy / VidSrc / 2Embed. The subtitle overlay renders as garbage instead of Cyrillic — either Latin-1 mojibake (âïˆâûˆ ïî_âèëàü) or a mix of Arabic + Latin (انهٌü يهٍ), depending on which mangling the source CDN applied.

Root cause

Two distinct, but overlapping, issues in the third-party subtitle CDNs the embed players load files from:

  1. Legacy single-byte encodings without a charset. Several CDNs still serve Russian subtitles as raw CP1251 bytes with Content-Type: text/plain (no charset=). Browsers fall back to UTF-8 / Latin-1 and render mojibake.

  2. sub.wyzie.io has a broken CP1251 → UTF-8 converter. Its encoding=UTF-8 pipeline interprets each source byte using a mix of CP1252 (for the 0xA0-0xFF range) and CP1256 (which “wins” for bytes that overlap with the Arabic block in 0xC0-0xDF). The output is valid UTF-8 but reads as Arabic + Latin-1-supplement gibberish instead of Cyrillic. Re-requesting with encoding=cp1251 / encoding=windows-1251 / no encoding / encoding=raw / encoding=original all return the exact same broken bytes — there is no upstream parameter we can flip to bypass it.

Verified by fetching a real Russian sub URL and reversing the mangling byte-by-byte:

انهٌü يهٍ.                  →   Здесь нет.
أنه وه ٍû?                  →   Где же ты?
حàêîيهِ-ٍî ىû نîلًàëèٌü ٌ‏نà!  →   Наконец-то мы добрались сюда!
رàٌêه!                      →   Саске!

That recovery is exactly what this PR does at runtime.

What this PR does

webRequest.onBeforeRequest in Electron can only redirect or cancel a request — it cannot rewrite the response body. To get UTF-8 bytes into the embed iframe, the PR introduces a tiny loopback HTTP server (same pattern as the existing _playerServer in src/ipc/allmanga.js) that:

  • accepts the upstream URL as a base64url query param,
  • fetches the file (preserving referer for CDNs that need it),
  • detects the encoding (declared charset → UTF-8 BOM → strict UTF-8 → CP1251/CP1252 heuristic),
  • when the strict UTF-8 step succeeds but the text shows the unique Wyzie-mangle signature (Arabic + Latin-1 supplement chars together, no real Cyrillic), reverses the mangling by mapping each character back through CP1252→byte / CP1256→byte tables and decoding the resulting byte stream as CP1251,
  • accepts the recovery only when it produces ≥30% Cyrillic chars (so legitimate Arabic, French, etc. subtitles pass through unchanged),
  • re-serves the file with the original Content-Type and a forced charset=utf-8.

main.js registers the proxy on app.whenReady() and redirects *.vtt, *.srt, and wyzie.{io,ru}/c/* requests in persist:player through it.

Coverage

Works for all CP1251-encoded Cyrillic languages: Russian, Ukrainian, Belarusian, Bulgarian, Serbian (Cyrillic), Macedonian.

Files changed

  • src/ipc/subtitleProxy.js (new, ~210 lines) — the loopback transcoder + Wyzie de-mangle.
  • index.js — register the proxy at startup; widen the player session URL filter to include .srt and the Wyzie /c/* path; redirect detected subtitle requests through the proxy. The Wyzie pattern is intentionally narrowed to /c/* so the unrelated /search endpoint isn't dragged into the media-only branch (where it would be cancelled as if it were an ad).

How to verify

  1. Pick any movie / TV episode and switch the source to Videasy (also reproducible on VidSrc / 2Embed when Cyrillic subs are available).
  2. Enable Russian subtitles.

Before this PR: subtitle overlay shows _ُيَآè û Ĵ_ ٯو_ُ àçñôëàé! or âïˆâûˆ ïî_âèëàü.
After this PR: subtitle overlay shows the actual Russian translation (Здесь нет., Где же ты?, etc.).

Regression-tested against the same Naruto S1E1 episode for Russian (broken→fixed), English, Portuguese, French (with diacritics), and Arabic (real Arabic file) — only the Russian file gets touched; everything else passes through bit-for-bit.

Notes

  • The fix is a workaround for an upstream bug in Wyzie's converter — once Wyzie stops mangling files, the de-mangle path will simply stop triggering (signature requires Arabic + Latin-1 supplement chars together; properly-encoded UTF-8 Cyrillic doesn't match it).
  • No new dependencies. Uses http and built-in TextDecoder/fetch.

Embed players (Videasy, VidSrc, 2Embed) fetch subtitle files directly
from third-party CDNs. Two problems made Cyrillic subtitles unreadable:

  1. Some CDNs serve files in legacy single-byte encodings (CP1251 for
     Russian) without a charset header. Browsers fall back to UTF-8 /
     Latin-1 and render mojibake.

  2. sub.wyzie.io converts CP1251 to "UTF-8" with a broken pipeline that
     mixes CP1252/CP1256 char tables. The result is valid UTF-8 but reads
     as Arabic + Latin-1 supplement gibberish instead of Cyrillic.

Add a small loopback HTTP server (src/ipc/subtitleProxy.js) that fetches
each subtitle file, detects/repairs the encoding (incl. reversing the
Wyzie mangle by mapping each char back through CP1252/CP1256 reverse
tables and decoding the resulting byte stream as CP1251), and re-serves
the file as UTF-8. main.js redirects subtitle requests in persist:player
through the proxy via webRequest, which can only redirect — not rewrite
response bodies.

The Wyzie fix triggers only on the unique mojibake signature (Arabic AND
Latin-1 supplement chars in the same file, no real Cyrillic) and is
validated post-recovery (≥30% Cyrillic chars), so legitimate Arabic,
French, etc. subtitles pass through unchanged.

Covers all CP1251-encoded Cyrillic languages: Russian, Ukrainian,
Belarusian, Bulgarian, Serbian (Cyrillic), Macedonian.

Also widens the playerSession URL filter to catch .srt files and Wyzie's
/c/<hash>/id/<id> path (no extension), and narrows the Wyzie pattern to
/c/* so the /search endpoint isn't dragged into the media-only branch
(which would cancel it as if it were an ad).
Comment thread src/ipc/subtitleProxy.js Fixed
CodeQL flagged the fetch() call as a server-side request forgery sink
(github-advanced-security truelockmc#13) because the target URL is reconstructed
from a base64 query param. The proxy server binds to 127.0.0.1 only,
but any code that can reach it (any embed iframe in the player session)
could pivot through our Node process to scan the user's loopback / LAN
or hit cloud-metadata services.

Add isPrivateHostname() that rejects loopback / link-local / RFC1918
IPv4, IPv6 ULA + link-local + IPv4-mapped loopback, plus *.local /
*.internal / *.localhost hostnames. Apply it both to the initial
target URL and to every redirect hop (manual redirect handling, so a
public CDN can't 30x us into a private address either).

Validated locally:

  ✓ 400 ← http://127.0.0.1:22/
  ✓ 400 ← http://localhost:8080/
  ✓ 400 ← http://169.254.169.254/  (cloud metadata)
  ✓ 400 ← http://192.168.1.1/
  ✓ 200 ← http://example.com/      (public, still works)
@nevatas
Copy link
Copy Markdown
Author

nevatas commented May 20, 2026

@truelockmc — quick heads-up on the CodeQL SSRF alert above:

Commit a01d356 addresses it. The original review now shows as Outdated because it pointed at the old fetch(target, { redirect: "follow" }) call, which has been replaced with an isPrivateHostname()-guarded wrapper:

  • The target URL parsed from the base64 query param is now rejected if its host resolves to loopback (127.0.0.0/8, ::1), link-local (169.254/16, fe80::/10), RFC1918 (10/8, 172.16/12, 192.168/16), unique-local (fc00::/7), multicast / reserved, or hostnames like localhost, *.local, *.internal, *.localhost.
  • Redirects are now followed manually (max 5 hops) and each Location is re-checked through the same guard, so a public CDN can't 30x us into a private address.
  • Validated locally:
    ✓ 400 ← http://127.0.0.1:22/
    ✓ 400 ← http://localhost:8080/
    ✓ 400 ← http://169.254.169.254/   (cloud metadata)
    ✓ 400 ← http://192.168.1.1/
    ✓ 200 ← http://example.com/        (public, still works)
    

The PR is currently blocked on "1 workflow awaiting approval" — could you approve the workflow run so CodeQL can re-scan? It should clear the alert against a01d356 automatically once it runs.

Comment thread src/ipc/subtitleProxy.js
Comment on lines +72 to +79
const res = await fetch(current, {
redirect: "manual",
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
...(referer ? { Referer: referer } : {}),
},
});
@truelockmc
Copy link
Copy Markdown
Owner

This is kind of suspicious.
A newly created account and the only thing he did is approve this PR?!

And the PR has security problems?!

I think I will close this for now

@truelockmc truelockmc closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants