Skip to content

Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#34

Closed
iliaal wants to merge 103 commits intomasterfrom
fix/gh-21734-lexbor-utf8-validation
Closed

Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#34
iliaal wants to merge 103 commits intomasterfrom
fix/gh-21734-lexbor-utf8-validation

Conversation

@iliaal
Copy link
Copy Markdown
Owner

@iliaal iliaal commented Apr 12, 2026

Fixes php#21734.

lxb_encoding_decode_valid_utf_8_single() skipped all UTF-8 validation (continuation byte range, overlong sequences, surrogates), trusting the caller to pass valid input. The URL parser calls it on untrusted user input at 7 sites, and the IDNA code calls it on percent-decoded hostname bytes at 2 more.

Overlong ASCII characters in hostnames passed through IDNA processing as their target codepoints, producing valid domains from byte sequences that look nothing like the canonical form. For example, %C1%A5%C1%B6%C1%A9%C1%AC.com resolved to evil.com. Chrome, Firefox, and Safari reject overlong sequences at the UTF-8 decode step.

The fix adds the missing validation to decode_valid_utf_8_single:

  • 2-byte: reject lead bytes < 0xC2 (overlong), validate continuation byte range
  • 3-byte: validate continuations, reject 0xE0 + < 0xA0 (overlong), reject 0xED + > 0x9F (surrogates)
  • 4-byte: reject lead > 0xF4, validate continuations, reject 0xF0 + < 0x90 (overlong), reject 0xF4 + > 0x8F (> U+10FFFF)

On error, the decoder advances by 1 byte (not the full sequence length) so the next byte gets its own decode attempt, matching browser behavior.

Test results: 309/309 ext/uri tests pass, 846/846 ext/dom tests pass (19 skipped, same as baseline). No regressions.

iluuu1994 and others added 30 commits April 12, 2026 00:20
stream_socket_accept($server, 3) would frequently run into a race-condition
where the call would timeout and return false, triggering an exception when
calling fclose(false) and terminating the process. This would break the
phpt_notify() call in the main process due to a broken pipe. Increase the
timeout to solve this.

Furthermore, remove the proxy in the test that is not necessary to trigger the
original bug solved in 7782b88.

Closes phpGH-21692
* PHP-8.4:
  Simplify gh21031.phpt and solve flakiness
* PHP-8.5:
  Simplify gh21031.phpt and solve flakiness
ramsey and others added 29 commits April 11, 2026 21:20
… continuation bytes

lxb_encoding_decode_valid_utf_8_single() skipped all UTF-8 validation
(continuation byte range, overlong sequences, surrogates), trusting
the caller to pass valid input. The URL parser calls it on untrusted
user input at 7 sites, and the IDNA code calls it on percent-decoded
hostname bytes at 2 more.

Overlong ASCII characters in hostnames passed through IDNA processing
as their target codepoints, producing valid domains from byte sequences
that look nothing like the canonical form (e.g., %C1%A5%C1%B6... →
"evil.com"). Chrome, Firefox, and Safari reject these at the UTF-8
decode step.

Add the missing validation to decode_valid_utf_8_single:
- 2-byte: reject lead bytes < 0xC2 (overlong), validate continuation
- 3-byte: validate continuations, reject 0xE0 + < 0xA0 (overlong),
  reject 0xED + > 0x9F (surrogates)
- 4-byte: reject lead > 0xF4, validate continuations, reject
  0xF0 + < 0x90 (overlong), reject 0xF4 + > 0x8F (> U+10FFFF)

On error, advance by 1 byte (not the full sequence length) so the
next byte gets its own decode attempt, matching browser behavior.

Closes phpGH-21734
@iliaal iliaal closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ext/uri: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes in hostnames

3 participants