-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error-prone HTTP-header parsing in ARCRecord #41
Comments
I assume you mean 'wrapping' in the sense of putting into a type of InputStream that limits reading beyond the expected length? Are there other 'bad data' cases where the information you'd use to make this judgement would be wrong? |
Yes, by wrapping I mean forcibly limiting the amount of bytes that can be read. As far as I can see, the un-protected stream is only used in header parsing and HTTP-response parsing. Ee cannot avoid doing the header-parse unprotected, as that provides us with the length. |
Actually the IA warc readers used in NAS use the JWAT HttpPayload parser/wrapper for exactly that reason. |
We tested 95 ARC-files with known problem against the code in issue #40 and a single one failed at the record:
The lines above were immediately followed by the header of the next record. So no content, although the HTTP-header promised 10 bytes. However, the size of the full record (the 5 HTTP-header-lines) were exactly 111 bytes, as stated in the record header. So I guess the ARC-record is technically correct, and the server just returned 0 bytes as content? Without having checked properly, I would guess that the readHttpParser reads through the 5 HTTP-headers and into the following record as there are no terminating blank line to signal the end of the HTTPheaders. Guarding against that would imply a limit on the bytes read in the readHttpParser and then we're back to the whole stream hardening. I could take a stab at this, but I am afraid that it won't be a trivial patch as there would have to be multiple changes to avoid double-wrapping. |
ARCRecord encapsulates streaming of record content, hardening against parsing mistakes. Unfortunately HTTP-headers are processed on the raw ARC-stream, allowing parsing of problematic headers to stream into other records, as demonstrated in issue #40. This results in aborted iteration of ARC records.
A more conservative approach would be to wrap the record block (HTTP-headers + content) as soon as possible; right after the ARC-entry-header-line has been parsed. Extraction of HTTP-headers would then take place within the hardened stream, isolating any parsing or data problems to the current record.
The text was updated successfully, but these errors were encountered: