Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error-prone HTTP-header parsing in ARCRecord #41

Open
tokee opened this issue Feb 18, 2015 · 4 comments
Open

Error-prone HTTP-header parsing in ARCRecord #41

tokee opened this issue Feb 18, 2015 · 4 comments

Comments

@tokee
Copy link

tokee commented Feb 18, 2015

ARCRecord encapsulates streaming of record content, hardening against parsing mistakes. Unfortunately HTTP-headers are processed on the raw ARC-stream, allowing parsing of problematic headers to stream into other records, as demonstrated in issue #40. This results in aborted iteration of ARC records.

A more conservative approach would be to wrap the record block (HTTP-headers + content) as soon as possible; right after the ARC-entry-header-line has been parsed. Extraction of HTTP-headers would then take place within the hardened stream, isolating any parsing or data problems to the current record.

@anjackson
Copy link
Member

I assume you mean 'wrapping' in the sense of putting into a type of InputStream that limits reading beyond the expected length? Are there other 'bad data' cases where the information you'd use to make this judgement would be wrong?

@tokee
Copy link
Author

tokee commented Feb 18, 2015

Yes, by wrapping I mean forcibly limiting the amount of bytes that can be read.

As far as I can see, the un-protected stream is only used in header parsing and HTTP-response parsing. Ee cannot avoid doing the header-parse unprotected, as that provides us with the length.

@nclarkekb
Copy link
Contributor

Actually the IA warc readers used in NAS use the JWAT HttpPayload parser/wrapper for exactly that reason.

@tokee
Copy link
Author

tokee commented Feb 18, 2015

We tested 95 ARC-files with known problem against the code in issue #40 and a single one failed at the record:

http://192.168.16.2/robots.txt 192.168.16.2 20111211155955 text/html 111
HTTP/1.1 404 Not found
Server: eHTTP v2.0
Connection: Keep-Alive
Content-Type: text/html
Content-Length:  10

The lines above were immediately followed by the header of the next record. So no content, although the HTTP-header promised 10 bytes. However, the size of the full record (the 5 HTTP-header-lines) were exactly 111 bytes, as stated in the record header. So I guess the ARC-record is technically correct, and the server just returned 0 bytes as content?

Without having checked properly, I would guess that the readHttpParser reads through the 5 HTTP-headers and into the following record as there are no terminating blank line to signal the end of the HTTPheaders. Guarding against that would imply a limit on the bytes read in the readHttpParser and then we're back to the whole stream hardening.

I could take a stab at this, but I am afraid that it won't be a trivial patch as there would have to be multiple changes to avoid double-wrapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants