TypeError leads to an unhandled Exception #70

SebastianDeiss · 2018-03-05T09:54:09Z

peepdf crashes with a TypeError if some PDFs are analyzed in force parsing mode and PDFObjectStream.resolveReferences() is invoked.

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: slice indices must be integers or None or have an __index__ method

If I fix that TypeError by converting offset at PDFCore.py:3243 to an int object I get another one:

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead)

A possible solution would be to supply the PDFParser object to PDFObjectStream when creating that instance and then provide the supplied PDFParser instance for readObject().

The text was updated successfully, but these errors were encountered:

A new optional parameter 'parser' was introduced for 'PDFObjectStream', which takes a 'PDFParser' object. This object is then used in 'resolveReferences()' to invoke 'PDFParser.readObject()'. This commit fixes jesparza#70.

SebastianDeiss · 2018-04-09T09:15:40Z

@jesparza I could submit a pull for this issue like hatching#6, which is based on your master.

Jack28 · 2020-02-13T14:42:06Z

google.com?q=filetype:pdf https://en.fh-westkueste.de/students/his/
These files created by HIS also produce an error.
Could it be related?

For unclear reasons, PDFObjectStream.update() delays decoding of the modified raw stream until all references can be resolved. It does however then go on to always try to extract objects from the still empty decoded stream. This produces an error from peepdf cli: $ peepdf image.php Error: An error has occurred while parsing an indirect object!! The error from PDFObjectStream.update() is "Missing offsets in object stream" because self.decodedStream is still empty at that point, making offsetsSection and eventually the numbers list empty, causing the abort. This is triggered by /Length being a reference and setting updateNeeded to True. Sample: https://www.infotek.co.jp/pdflib/demo/sample/image.php. Relevant PDF structure: 32 0 obj <</Length 43 0 R/Filter/FlateDecode/Type/ObjStm/N 7/First 47>> stream [...] endstream endobj 43 0 obj 461 endobj (Length in dict of object 32 R-eferences object 43 which contains 461 what presumably is the length of the stream - which does not seem to be used or checked for consistency by peepdf atm, btw.) This resolves the first half of jesparza#70 in that force mode is no longer necessary to parse such files at all. Signed-off-by: Michael Weiser <[email protected]>

With the previous change deferring reading of objects from the decoded stream until references can be resolved, it now runs into jesparza#70. This change provides a different approach in fixing it to hatching#6 by syncing it with the other locations where the identical code is in use: 1. Force the numbers extracted by re.findall to int() as before, avoiding the TypeError exception: Traceback (most recent call last): File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/main.py", line 409, in main ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis) File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 7117, in parse ret = body.updateObjects() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 4291, in updateObjects object.resolveReferences() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 3256, in resolveReferences ret = PDFParser.readObject(objectsSection[offset:]) TypeError: slice indices must be integers or None or have an __index__ method 2. Instantiate a new PDFParser object by adding the missing braces, avoiding another TypeError because readObject is no class method: Traceback (most recent call last): File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/main.py", line 409, in main ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis) File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 7118, in parse ret = body.updateObjects() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 4292, in updateObjects object.resolveReferences() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 3256, in resolveReferences ret = PDFParser.readObject(objectsSection[offset:]) TypeError: unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead) 3. Explicitly force the id to be an int() as well and append it do the list of indices as at the other callsites of this code. This solves no issue I have run into but seems sensible to avoid other potential TypeErrors and keep internal bookkeeping of the object consistent. This should conclusively resolve jesparza#70 and supersedes hatching#6. Signed-off-by: Michael Weiser <[email protected]>

michaelweiser · 2020-03-06T19:11:10Z

An extended fix for the TypeErrors is now over at hatching#9.

I have also seen some of those HIS-generated PDFs (which originate from Apache FOP 2.3) and they only ran into the object stream parsing problem caused by PDFParser.readUntilSymbol() resetting the buffer cursor fixed by commit 1 of that PR but not the TypeErrors. (That separate issue actually only exists in jbremer's fork.)

* Fix object stream parsing Commit 8cc27b6 broke object stream parsing by resetting the content cursor PDFParser.charCounter to zero on every invocation. This broke object stream parsing. Reproducer: $ echo -e "create pdf\ncreate object_stream\nall\nsave /tmp/foo.pdf" | \ peepdf -i Without fix: $ peepdf -j /tmp/foo.pdf Error: An error has occurred while parsing an indirect object!! With this change: JSON output as expected (same for other outputs). $ peepdf -j /tmp/foo.pdf { "peepdf_analysis": { [...] "version": "0.3" } } } Signed-off-by: Michael Weiser <[email protected]> * Delay reading of objects until references are resolved For unclear reasons, PDFObjectStream.update() delays decoding of the modified raw stream until all references can be resolved. It does however then go on to always try to extract objects from the still empty decoded stream. This produces an error from peepdf cli: $ peepdf image.php Error: An error has occurred while parsing an indirect object!! The error from PDFObjectStream.update() is "Missing offsets in object stream" because self.decodedStream is still empty at that point, making offsetsSection and eventually the numbers list empty, causing the abort. This is triggered by /Length being a reference and setting updateNeeded to True. Sample: https://www.infotek.co.jp/pdflib/demo/sample/image.php. Relevant PDF structure: 32 0 obj <</Length 43 0 R/Filter/FlateDecode/Type/ObjStm/N 7/First 47>> stream [...] endstream endobj 43 0 obj 461 endobj (Length in dict of object 32 R-eferences object 43 which contains 461 what presumably is the length of the stream - which does not seem to be used or checked for consistency by peepdf atm, btw.) This resolves the first half of jesparza#70 in that force mode is no longer necessary to parse such files at all. Signed-off-by: Michael Weiser <[email protected]> * Avoid TypeError on reference resolution With the previous change deferring reading of objects from the decoded stream until references can be resolved, it now runs into jesparza#70. This change provides a different approach in fixing it to hatching#6 by syncing it with the other locations where the identical code is in use: 1. Force the numbers extracted by re.findall to int() as before, avoiding the TypeError exception: Traceback (most recent call last): File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/main.py", line 409, in main ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis) File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 7117, in parse ret = body.updateObjects() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 4291, in updateObjects object.resolveReferences() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 3256, in resolveReferences ret = PDFParser.readObject(objectsSection[offset:]) TypeError: slice indices must be integers or None or have an __index__ method 2. Instantiate a new PDFParser object by adding the missing braces, avoiding another TypeError because readObject is no class method: Traceback (most recent call last): File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/main.py", line 409, in main ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis) File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 7118, in parse ret = body.updateObjects() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 4292, in updateObjects object.resolveReferences() File "peepdf-venv2/lib64/python2.7/site-packages/peepdf/PDFCore.py", line 3256, in resolveReferences ret = PDFParser.readObject(objectsSection[offset:]) TypeError: unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead) 3. Explicitly force the id to be an int() as well and append it do the list of indices as at the other callsites of this code. This solves no issue I have run into but seems sensible to avoid other potential TypeErrors and keep internal bookkeeping of the object consistent. This should conclusively resolve jesparza#70 and supersedes hatching#6. Signed-off-by: Michael Weiser <[email protected]> * Fix PDFParser.readSymbol(), if while-space characters precede the symbol In PDF files the Cross-Reference Table or a Cross-Reference Stream contain byte-offsets for the start of objects within the file or the uncompressed stream. Such an offset does not always point the first byte of the initial token (see ISO 32000-2008 section 7.2.2) of the referenced object. The object may be preceded by white-space characters and comments. Without this commit PDFParser.readSymbol() fails to read a symbol, if the first character to be processed is a white-space character. This commit changes PDFParser.readSymbol() to skip leading white-space characters. (PDFParser.readSymbol() already skips any number of leading comments followed by white-space characters.) This enables passing of PDF-files with sloppy cross reference offsets. * Add the missing method PDFArray.getJSCode() An object of class PDFArray can contain JS-code, if one or more array-elements contain JS-code. The getter method was simply missing. * Handle sloppy cross references more and less generically A previous commit adjusted readSymbol() to skip leading whitespace in order to avoid errors with sloppy cross references. This did not fix handling of literals such as numbers and booleans in readObject() because they're not accessed using readSymbol(). Also, adjusting the very low-level readSymbol() function might generate fallout. So instead, this change moves the skipping of leading whitespace into readObject() so that it affects all types of referenced objects equally but not all symbol lookups altogether. Signed-off-by: Michael Weiser <[email protected]> Signed-off-by: Michael Weiser <[email protected]> Co-authored-by: Michael Weiser <[email protected]> Co-authored-by: Anselm Kruis <[email protected]>

SebastianDeiss linked a pull request Mar 6, 2018 that will close this issue

Fix TypeError that leads to an unhandled Exception hatching/peepdf#6

Open

jesparza self-assigned this Apr 7, 2018

jesparza added bug Priority-Medium Type-Defect labels Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError leads to an unhandled Exception #70

TypeError leads to an unhandled Exception #70

SebastianDeiss commented Mar 5, 2018

SebastianDeiss commented Apr 9, 2018

Jack28 commented Feb 13, 2020

michaelweiser commented Mar 6, 2020

TypeError leads to an unhandled Exception #70

TypeError leads to an unhandled Exception #70

Comments

SebastianDeiss commented Mar 5, 2018

SebastianDeiss commented Apr 9, 2018

Jack28 commented Feb 13, 2020

michaelweiser commented Mar 6, 2020