Skip to content

Conversation

Crozzers
Copy link
Contributor

@Crozzers Crozzers commented Sep 22, 2024

This PR fixes #601 by making the HTML tokenisation regex more lenient.

Currently, the regex matches HTML tags and their attributes, using [\w-] for the attribute name. However, in the HTML spec it says:

Attribute names must consist of one or more characters other than controls, U+0020 SPACE, U+0022 ("), U+0027 ('), U+003E (>), U+002F (/), U+003D (=), and noncharacters

The current character class being used does not include all these possibilities so I've updated it to instead exclude these banned characters: [^<>"'=/].

I also tweaked how the attribute values are matched. The current regex only allows for quoted values, which lets src=# onerror=alert() slip past. I've added another clause in the regex that matches unquoted attr values as long as they don't contain a space (because that would count as the next attribute)

@nicholasserra
Copy link
Collaborator

Thank you!

@nicholasserra nicholasserra merged commit cc432bf into trentm:master Sep 23, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Safe mode XSS

2 participants