-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warcio check does not raise error when GZip records are truncated #138
Comments
This came up recently in IIPC Slack when trying to diagnose why warcheology was reporting a corrupted WARC file, and warcio was not. It appeared that the WARC file was truncated as a result of a browsertrix-crawler container exiting abnormally, and not closing the GZIP file properly... In case it's helpful to have a test script (which doesn't emit a warning that I can see): from warcio.archiveiterator import ArchiveIterator
with open('test.warc.gz', 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
if record.rec_type == 'response':
content = record.content_stream().read() And here's a test file: test.warc.gz
|
Wow, I'd totally forgotten about this! Seems like there's a hook in the underlying Python library to spot this case:: https://docs.python.org/3/library/zlib.html#zlib.Decompress.eof
But it's not clear to me how to weave that in here... warcio/warcio/archiveiterator.py Lines 108 to 140 in aa702cb
|
@edsu what record in test.warc.gz is the truncated one? And where can I find warcheology? Thanks. |
I believe it's the last record. If you try to gunzip the file, you should see the error error right at the end? I'm not really familiar with it but here is the warchaeology repo: https://github.com/nlnwa/warchaeology |
@edsu thanks for adding a simple test and @anjackson for looking up the With that, I think detecting this case can be done as follows: diff --git a/warcio/archiveiterator.py b/warcio/archiveiterator.py
index 484b7f0..451f182 100644
--- a/warcio/archiveiterator.py
+++ b/warcio/archiveiterator.py
@@ -113,7 +113,13 @@ class ArchiveIterator(six.Iterator):
yield self.record
- except EOFError:
+ except EOFError as e:
+ if self.reader.decompressor:
+ if not self.reader.decompressor.eof:
+ sys.stderr.write("warning: final record appears to be truncated")
+
empty_record = True
self.read_to_end() But, what is the desired behavior be more generally?
It sort of depends on how the WARC is being used:
|
One of the most likely problems we see is failed transfers leading to truncated WARC.GZ files. We can spot this with
gunzip -t
but it would be good ifwarcio check
also raised this as a validation error. My tests so far have indicated that the warcio and cdxj-indexer etc. tools all skip over these errors silently.The text was updated successfully, but these errors were encountered: