generated from langchain-ai/integration-repo-template
-
Notifications
You must be signed in to change notification settings - Fork 94
Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
skyking363
wants to merge
8
commits into
langchain-ai:main
Choose a base branch
from
skyking363:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit addresses a `ValueError: cannot reshape array of size` that could occur in `PyMuPDFParser` and `PyPDFParser` when processing images embedded in PDF files. The main causes were: - Mismatch between image data length and dimensions (height, width, channels) declared in PDF metadata. - Attempts to reshape data that was not a clean multiple of H*W*C. Changes include: For PyMuPDFParser: - Validates `pix.samples` length against `pix.height * pix.width * pix.n`. - Uses `pix.n` directly in reshape instead of -1 if valid (1,3,4). - Adds fallback to infer channels if `pix.n` is problematic but data size is a multiple of H*W. - Skips image with detailed warning if data is inconsistent. - Converts numpy array to PIL Image and saves as PNG for the `images_parser`, instead of `numpy.save` as .npy. - Handles RGBA images with fully opaque alpha by converting to RGB. - Handles cases where pix.n might be 0 by attempting to infer channels. For PyPDFParser: - For images with filters in `_PDF_FILTER_WITHOUT_LOSS`: - Validates raw data length against `height * width`. - Infers channels from `actual_size / (height * width)`. - Skips image if inferred channels are not 1, 3, or 4. - Uses inferred channels in reshape. - For images with filters in `_PDF_FILTER_WITH_LOSS` (e.g., JPEG): - Improves error handling around `Image.open()`. - Standardizes post-processing for both filter types: - Converts numpy array (or PIL image from `Image.open`) to PIL Image. - Adjusts PIL image mode based on (inferred) channels (L, RGB, RGBA). - Handles RGBA images with fully opaque alpha by converting to RGB. - Saves image as PNG for the `images_parser`. - Improved parsing of the `/Filter` attribute. - Adds extensive logging for skipped or problematic images. These changes make the PDF image extraction more robust by gracefully handling malformed or unusual image data instead of raising a ValueError, and by standardizing the image format passed to the subsequent image parser.
Fix: Handle image reshape errors in PDF parsers
This commit fixes an error in PyPDFParser where an attempt was made to access a non-existent `.name` attribute on a `pypdf.generic.NameObject` when extracting image filter names. The correct way to get the string value of a NameObject is to convert it directly to a string (e.g., `str(name_object)`). Changes: - Modified `PyPDFParser.extract_images_from_page` to use `str(filter_name_obj)` to get the filter name string (e.g., "/FlateDecode"). - Ensured that the leading slash is removed from the filter name. - Added checks for empty filter arrays and non-NameObject elements within filter arrays. This resolves the error "AttributeError: 'NameObject' object has no attribute 'name'" and improves the robustness of image filter parsing.
Fix: Correct NameObject handling in PyPDFParser for image filters
This commit addresses an issue where CCITTFaxDecode-compressed images were incorrectly processed by the np.frombuffer().reshape() logic in PyPDFParser. This happened because the reshape logic assumes raw pixel data with byte-aligned pixels, which is not true for 1-bit CCITTFaxDecode data, and also because pypdf's get_data() might return compressed or partially processed streams for such filters. Changes: - Modified `PyPDFParser.extract_images_from_page`: - Created a `_FILTERS_HANDLED_BY_PILLOW` list that includes `DCTDecode`, `JPXDecode` (from original _PDF_FILTER_WITH_LOSS) and now also `CCITTFaxDecode` and `JBIG2Decode`. - Images with filters in this list are now processed using `Image.open(io.BytesIO(raw_data))`, leveraging Pillow's decoding capabilities. - Added handling for `mode='1'` (1-bit images) when determining channels after Pillow processing. - The `np.frombuffer().reshape()` path is now reserved for other filters in `_PDF_FILTER_WITHOUT_LOSS` (e.g., FlateDecode, LZWDecode) where the output of `get_data()` is more likely to be raw pixel data suitable for direct reshaping. - Added a safeguard in the reshape path to skip if CCITTFaxDecode or JBIG2Decode inadvertently reach it. This change makes image extraction more robust for PDFs containing CCITTFaxDecode and JBIG2Decode encoded images.
Fix: Route CCITTFaxDecode and JBIG2Decode to Pillow in PyPDFParser
This commit further enhances image extraction robustness in `PyPDFParser` by routing more filter types to be processed by Pillow (`Image.open()`). This addresses issues where filters like `FlateDecode` might produce a complete image file stream (e.g., a PNG) rather than raw pixel data suitable for direct `reshape`. Changes: - Expanded the `_FILTERS_HANDLED_BY_PILLOW` list in `PyPDFParser.extract_images_from_page` to include: - `FlateDecode` (and its abbreviation `Fl`) - `LZWDecode` (and its abbreviation `LZW`) - `RunLengthDecode` (and its abbreviation `RL`) - This means that data streams processed with these filters will now first be attempted to be opened by Pillow. This is more likely to succeed if the stream is a standard image format (e.g., PNG, TIFF, GIF). - Enhanced the Pillow processing path to better handle various image modes encountered after `Image.open()`, including converting Palette (`P`) images and other modes (like `CMYK`) to `RGB` or `RGBA` before converting to a NumPy array. - The `np.frombuffer().reshape()` path is now reserved for a much smaller set of circumstances, primarily for filters not explicitly handled by Pillow where the data might be raw, unformatted pixel data. This change significantly increases the chances of successfully extracting images that use common non-lossy compression filters which might wrap standard image file formats, reducing `reshape` errors and improving overall image extraction reliability.
Fix: Route more filters (FlateDecode, LZW, RLE) to Pillow in PyPDFParser
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This pull request significantly enhances the robustness of image extraction in the
PDFParser
module by:These changes improve compatibility when parsing PDFs containing diverse or non-standard embedded image formats.
Changes
extract_images_from_page()
inpdf.py
to catch and handle reshape errors gracefully.obj
→obj_name
) to improve clarity and avoid potential name clashes.Why It Matters
PDF documents often embed image streams in complex or irregular encodings. Without proper handling, such cases can crash document loaders or break downstream document indexing pipelines. This PR ensures:
Testing
PyMuPDF
andPyPDF2
backends.Related