Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

skyking363 · 2025-06-22T07:51:48Z

Summary

This pull request significantly enhances the robustness of image extraction in the PDFParser module by:

Routing more image filters (e.g., CCITTFaxDecode, JBIG2Decode, FlateDecode, LZW, RLE) to Pillow.
Adding error handling for numpy reshape failures when processing corrupted or malformed image streams.
Refactoring variable names for clarity and readability.

These changes improve compatibility when parsing PDFs containing diverse or non-standard embedded image formats.

Changes

Updated extract_images_from_page() in pdf.py to catch and handle reshape errors gracefully.
Added support to route more image filters (CCITTFaxDecode, JBIG2Decode, FlateDecode, LZW, RLE) to Pillow rendering pipeline.
Renamed loop variables (e.g., obj → obj_name) to improve clarity and avoid potential name clashes.
Improved logging context and stability when working with raw image data from PDF XObjects.

Why It Matters

PDF documents often embed image streams in complex or irregular encodings. Without proper handling, such cases can crash document loaders or break downstream document indexing pipelines. This PR ensures:

More PDFs can be successfully parsed without manual cleaning.
Edge cases with malformed image metadata will fail gracefully rather than cause unhandled exceptions.
The system becomes more production-ready for RAG ingestion or document QA applications.

Testing

Manually validated against PDFs containing JBIG2, CCITT, and broken image encodings.
Confirmed compatibility with PyMuPDF and PyPDF2 backends.

This commit addresses a `ValueError: cannot reshape array of size` that could occur in `PyMuPDFParser` and `PyPDFParser` when processing images embedded in PDF files. The main causes were: - Mismatch between image data length and dimensions (height, width, channels) declared in PDF metadata. - Attempts to reshape data that was not a clean multiple of H*W*C. Changes include: For PyMuPDFParser: - Validates `pix.samples` length against `pix.height * pix.width * pix.n`. - Uses `pix.n` directly in reshape instead of -1 if valid (1,3,4). - Adds fallback to infer channels if `pix.n` is problematic but data size is a multiple of H*W. - Skips image with detailed warning if data is inconsistent. - Converts numpy array to PIL Image and saves as PNG for the `images_parser`, instead of `numpy.save` as .npy. - Handles RGBA images with fully opaque alpha by converting to RGB. - Handles cases where pix.n might be 0 by attempting to infer channels. For PyPDFParser: - For images with filters in `_PDF_FILTER_WITHOUT_LOSS`: - Validates raw data length against `height * width`. - Infers channels from `actual_size / (height * width)`. - Skips image if inferred channels are not 1, 3, or 4. - Uses inferred channels in reshape. - For images with filters in `_PDF_FILTER_WITH_LOSS` (e.g., JPEG): - Improves error handling around `Image.open()`. - Standardizes post-processing for both filter types: - Converts numpy array (or PIL image from `Image.open`) to PIL Image. - Adjusts PIL image mode based on (inferred) channels (L, RGB, RGBA). - Handles RGBA images with fully opaque alpha by converting to RGB. - Saves image as PNG for the `images_parser`. - Improved parsing of the `/Filter` attribute. - Adds extensive logging for skipped or problematic images. These changes make the PDF image extraction more robust by gracefully handling malformed or unusual image data instead of raising a ValueError, and by standardizing the image format passed to the subsequent image parser.

Fix: Handle image reshape errors in PDF parsers

This commit fixes an error in PyPDFParser where an attempt was made to access a non-existent `.name` attribute on a `pypdf.generic.NameObject` when extracting image filter names. The correct way to get the string value of a NameObject is to convert it directly to a string (e.g., `str(name_object)`). Changes: - Modified `PyPDFParser.extract_images_from_page` to use `str(filter_name_obj)` to get the filter name string (e.g., "/FlateDecode"). - Ensured that the leading slash is removed from the filter name. - Added checks for empty filter arrays and non-NameObject elements within filter arrays. This resolves the error "AttributeError: 'NameObject' object has no attribute 'name'" and improves the robustness of image filter parsing.

Fix: Correct NameObject handling in PyPDFParser for image filters

This commit addresses an issue where CCITTFaxDecode-compressed images were incorrectly processed by the np.frombuffer().reshape() logic in PyPDFParser. This happened because the reshape logic assumes raw pixel data with byte-aligned pixels, which is not true for 1-bit CCITTFaxDecode data, and also because pypdf's get_data() might return compressed or partially processed streams for such filters. Changes: - Modified `PyPDFParser.extract_images_from_page`: - Created a `_FILTERS_HANDLED_BY_PILLOW` list that includes `DCTDecode`, `JPXDecode` (from original _PDF_FILTER_WITH_LOSS) and now also `CCITTFaxDecode` and `JBIG2Decode`. - Images with filters in this list are now processed using `Image.open(io.BytesIO(raw_data))`, leveraging Pillow's decoding capabilities. - Added handling for `mode='1'` (1-bit images) when determining channels after Pillow processing. - The `np.frombuffer().reshape()` path is now reserved for other filters in `_PDF_FILTER_WITHOUT_LOSS` (e.g., FlateDecode, LZWDecode) where the output of `get_data()` is more likely to be raw pixel data suitable for direct reshaping. - Added a safeguard in the reshape path to skip if CCITTFaxDecode or JBIG2Decode inadvertently reach it. This change makes image extraction more robust for PDFs containing CCITTFaxDecode and JBIG2Decode encoded images.

Fix: Route CCITTFaxDecode and JBIG2Decode to Pillow in PyPDFParser

This commit further enhances image extraction robustness in `PyPDFParser` by routing more filter types to be processed by Pillow (`Image.open()`). This addresses issues where filters like `FlateDecode` might produce a complete image file stream (e.g., a PNG) rather than raw pixel data suitable for direct `reshape`. Changes: - Expanded the `_FILTERS_HANDLED_BY_PILLOW` list in `PyPDFParser.extract_images_from_page` to include: - `FlateDecode` (and its abbreviation `Fl`) - `LZWDecode` (and its abbreviation `LZW`) - `RunLengthDecode` (and its abbreviation `RL`) - This means that data streams processed with these filters will now first be attempted to be opened by Pillow. This is more likely to succeed if the stream is a standard image format (e.g., PNG, TIFF, GIF). - Enhanced the Pillow processing path to better handle various image modes encountered after `Image.open()`, including converting Palette (`P`) images and other modes (like `CMYK`) to `RGB` or `RGBA` before converting to a NumPy array. - The `np.frombuffer().reshape()` path is now reserved for a much smaller set of circumstances, primarily for filters not explicitly handled by Pillow where the data might be raw, unformatted pixel data. This change significantly increases the chances of successfully extracting images that use common non-lossy compression filters which might wrap standard image file formats, reducing `reshape` errors and improving overall image extraction reliability.

Fix: Route more filters (FlateDecode, LZW, RLE) to Pillow in PyPDFParser

google-labs-jules bot and others added 8 commits June 21, 2025 12:41

Merge pull request #1 from skyking363/fix/pdf-image-reshape-error

93faa1b

Fix: Handle image reshape errors in PDF parsers

Merge pull request #2 from skyking363/fix/pdf-image-reshape-error

6e1d642

Fix: Correct NameObject handling in PyPDFParser for image filters

Merge pull request #3 from skyking363/fix/pdf-image-reshape-error

f2a622d

Fix: Route CCITTFaxDecode and JBIG2Decode to Pillow in PyPDFParser

Merge pull request #4 from skyking363/fix/pdf-image-reshape-error

c26e100

Fix: Route more filters (FlateDecode, LZW, RLE) to Pillow in PyPDFParser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

Uh oh!

skyking363 commented Jun 22, 2025

Uh oh!

Uh oh!

Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

Are you sure you want to change the base?

Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

Uh oh!

Conversation

skyking363 commented Jun 22, 2025

Summary

Changes

Why It Matters

Testing

Related

Uh oh!

Uh oh!