Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA element fails to parse when element contains £ symbol #122

Open
tomtaylor opened this issue Sep 18, 2023 · 6 comments
Open

CDATA element fails to parse when element contains £ symbol #122

tomtaylor opened this issue Sep 18, 2023 · 6 comments

Comments

@tomtaylor
Copy link

We have an XML file which is failing to parse since we switched from calling File.stream!(path, [:compressed, :trim_bom]) to File.stream!(path, [:compressed, :trim_bom], 32_768). It throws the following error:

{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 100, 101, 115, 99, 114, 105, 112, 116, 105, 111, 110, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 60, 112, 62, 60, 115, 116, 114, 111, 110, 103, 62, 83, 65, 80, 32, 124, 32, 46, 78, 69, 84, 32, 124, ...>>, position: 92}}

The file is littered with empty CDATA elements. I wonder if one of those is aligning with the start/end of a buffer? I can provide the full XML file if useful - it's 114MB and I'd prefer not to provide it publicly.

@tomtaylor
Copy link
Author

tomtaylor commented Nov 20, 2023

I've managed to pull together a minimal Elixir script and sample XML that reproduces this error: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe

I don't think it's specifically related to streaming, but it does seem to be about a chunk aligning with a CDATA tag.

The test case uses Saxy.Partial and throws the following error:

{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 115, 97, 108, 97, 114, 121, 84, 111, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 194>>, position: 20}}

I believe the file is valid. I've run xmllint --valid --noout sample.xml against it and it looks fine, apart from the missing DTD, which I don't think Saxy cares about.

Let me know if I can give you any more information. Thanks!

@tomtaylor
Copy link
Author

tomtaylor commented Nov 21, 2023

Looking again with fresh eyes this morning I can see that it fails when the chunk passed to Saxy.Partial.parse doesn't contain the full CDATA close element (]]), only the first character of one (]). In my example, it received the following chunks:

  • ed><![CD
  • ATA[2023
  • -11-15 0
  • 2:16:59] <- this blows up

CDATA tags might be one of the few elements in XML where it's a multi character token, and so I imagine the streaming parser is getting tripped up on only seeing part of a token. Does that seem plausible?

@tomtaylor
Copy link
Author

@qcam any thoughts on this? You should have a full reproducible example linked above, but let me know if I can provide any more context.

@tomtaylor
Copy link
Author

tomtaylor commented Jul 22, 2024

Hi again @qcam - is there anything else we can do to help with this issue? I've poked around the code base to see if there's an obvious place to fix, but it's eluding me. There's a minimal reproducible example in the post above: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe

@tomtaylor tomtaylor changed the title CDATA element fails to parse when file is streamed as fixed bytes, not lines CDATA element fails to parse when partial ]]> token is available in chunk Jul 22, 2024
@tomtaylor tomtaylor changed the title CDATA element fails to parse when partial ]]> token is available in chunk CDATA element fails to parse when ]]> token is cut off by chunk Jul 22, 2024
@tomtaylor
Copy link
Author

tomtaylor commented Aug 4, 2024

I've made a bit of progress on this. This is failing when Saxy.Parser.Builder.element_cdata is receiving a non breaking space character (decimal codepoint 194). This isn't matched by the is_ascii guard, nor <<codepoint::utf8>>.

e.g.

buffer = <<194>>

case buffer do
  <<codepoint::utf8>> <> rest -> dbg(codepoint)
end

This will throw a CaseClauseError.

@tomtaylor tomtaylor changed the title CDATA element fails to parse when ]]> token is cut off by chunk CDATA element fails to parse when element contains non-breaking space Aug 4, 2024
@tomtaylor tomtaylor changed the title CDATA element fails to parse when element contains non-breaking space CDATA element fails to parse when element contains £ symbol Aug 5, 2024
@tomtaylor
Copy link
Author

OK, I think I fully understand this now. The £ symbol is encoded as <<0xC2, 0xA3>>. When parsing the file, if a chunk of data cuts off at 0xC2, then the parser will choke on it, because it's neither an ASCII char < 127, or a UTF-8 codepoint. I've added a PR in #133 which I think fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant