-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Entities cause split after the next character #511
Comments
Interestingly with just one small change, the string splits at slightly different boundaries. - let html_string = r#""ABC">DE"#;
+ let html_string = r#"ABC">DE"#; Output:
This is very unusual, my best explanation is that I can replicate the same behavior with |
I think I see why this is happening. When processing named entities in html5ever/html5ever/src/tokenizer/char_ref/mod.rs Lines 275 to 279 in 1ae2de3
Then in
I made a few attempts at fixing this, though in doing so I caused other tests to fail. So I'm afraid I'm not familiar enough with the code to draft a proper fix. I see that @untitaker has most recently interacted with this code so perhaps they have some ideas? |
it's not technically wrong for the tokenizer to produce multiple character tokens, and I don't think particular care is being taken in general in html5ever to ensure that multiple consecutive character tokens are being emitted as a single token. The behavior is purely oriented around what is most convenient and most efficient for the tokenizer itself. In the HTML spec, the character token can only contain one character, the fact that a character token in html5ever can contain multiple characters is just a performance optimization when a large buffer needs to be forwarded as-is. I believe there are more cases where character tokens are split up arbitrarily into multiple, and they might be harder to predict. I think it's better if users of the tokenizer handle this transparently. |
I'm generally in agreement RE: correctness. To me this is more of a performance concern. I see two potential sources of overhead:
I'm going to play around with the code some more and see if I can get all tests to pass while avoiding extra splitting of the input buffer. |
I noticed this behaviour in https://github.com/rusterlium/html5ever_elixir, but it seems to be part of this excellent project.
What I see is that HTML Entities cause a break split after the next regular character. So
""ABC">DE"
will return the tokens"\"', "A", "BC", "\"", ">", "D", "E"
.I expected one character token
"\"ABC\">DE"
instead.I wrote a program (with ChatGPT as I am not very familiar with rust) to demonstrate. If there is an error in there, please let me know.
gives the output
If you need any more information, please let me know.
The text was updated successfully, but these errors were encountered: