How to find images not included by page.get_text('rawdict')? #4370
Replies: 1 comment
-
I missed this part in the
Passing this option, and manually clipping the block bounding box to the page bounding box solved my problem. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have some scanned PDFs where each page's content is contained in an image, but the image doesn't get reported as a
type=1
image block when callingpage.get_text('rawdict')
. This usually happens when the image exceeds the page bounds by a few pixels, possibly due to bad scanner / doc-management software. It looks like this is by design, based on the notes here and here.The images missed by
page.get_text()
do get reported bypage.get_image_info()
.But the dicts from both calls differ, so a direct comparison may not be reliable (no hash, etc.) -
Image details from

page.get_text('rawdict')
:Image details from

page.get_image_info()
:Question: Is there a reliable way to check if the images reported by
page.get_image_info()
were already included bypage.get_text()
? Or, is there a better way to get images not reported bypage.get_text('rawdict')
?Beta Was this translation helpful? Give feedback.
All reactions