-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDATA element fails to parse when element contains £ symbol #122
Comments
I've managed to pull together a minimal Elixir script and sample XML that reproduces this error: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe I don't think it's specifically related to streaming, but it does seem to be about a chunk aligning with a CDATA tag. The test case uses {:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 115, 97, 108, 97, 114, 121, 84, 111, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 194>>, position: 20}} I believe the file is valid. I've run Let me know if I can give you any more information. Thanks! |
Looking again with fresh eyes this morning I can see that it fails when the chunk passed to
CDATA tags might be one of the few elements in XML where it's a multi character token, and so I imagine the streaming parser is getting tripped up on only seeing part of a token. Does that seem plausible? |
@qcam any thoughts on this? You should have a full reproducible example linked above, but let me know if I can provide any more context. |
Hi again @qcam - is there anything else we can do to help with this issue? I've poked around the code base to see if there's an obvious place to fix, but it's eluding me. There's a minimal reproducible example in the post above: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe |
]]>
token is available in chunk
]]>
token is available in chunk]]>
token is cut off by chunk
I've made a bit of progress on this. This is failing when e.g. buffer = <<194>>
case buffer do
<<codepoint::utf8>> <> rest -> dbg(codepoint)
end This will throw a |
]]>
token is cut off by chunk
OK, I think I fully understand this now. The £ symbol is encoded as |
We have an XML file which is failing to parse since we switched from calling
File.stream!(path, [:compressed, :trim_bom])
toFile.stream!(path, [:compressed, :trim_bom], 32_768)
. It throws the following error:The file is littered with empty CDATA elements. I wonder if one of those is aligning with the start/end of a buffer? I can provide the full XML file if useful - it's 114MB and I'd prefer not to provide it publicly.
The text was updated successfully, but these errors were encountered: