Arbitrary HTML present after sanitization because of unicode normalization
High severity
GitHub Reviewed
Published
May 5, 2024
in
matthiask/html-sanitizer
•
Updated May 6, 2024
Description
Published to the GitHub Advisory Database
May 6, 2024
Reviewed
May 6, 2024
Last updated
May 6, 2024
Impact
If using
keep_typographic_whitespace=False
(which is the default), the sanitizer normalizes unicode to the NFKC form at the end. Some unicode characters normalize to chevrons; this allows specially crafted HTML to escape sanitization.Patches
The problem has been fixed in 2.4.2.
Workarounds
Set
keep_typographic_whitespace=True
explicitly, or normalize to NFKC yourself earlier.References