Skip to content

Fix: Improve image extraction stability in PDF parser by handling reshape errors and routing image filters #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

skyking363
Copy link

Summary

This pull request significantly enhances the robustness of image extraction in the PDFParser module by:

  • Routing more image filters (e.g., CCITTFaxDecode, JBIG2Decode, FlateDecode, LZW, RLE) to Pillow.
  • Adding error handling for numpy reshape failures when processing corrupted or malformed image streams.
  • Refactoring variable names for clarity and readability.

These changes improve compatibility when parsing PDFs containing diverse or non-standard embedded image formats.


Changes

  • Updated extract_images_from_page() in pdf.py to catch and handle reshape errors gracefully.
  • Added support to route more image filters (CCITTFaxDecode, JBIG2Decode, FlateDecode, LZW, RLE) to Pillow rendering pipeline.
  • Renamed loop variables (e.g., objobj_name) to improve clarity and avoid potential name clashes.
  • Improved logging context and stability when working with raw image data from PDF XObjects.

Why It Matters

PDF documents often embed image streams in complex or irregular encodings. Without proper handling, such cases can crash document loaders or break downstream document indexing pipelines. This PR ensures:

  • More PDFs can be successfully parsed without manual cleaning.
  • Edge cases with malformed image metadata will fail gracefully rather than cause unhandled exceptions.
  • The system becomes more production-ready for RAG ingestion or document QA applications.

Testing

  • Manually validated against PDFs containing JBIG2, CCITT, and broken image encodings.
  • Confirmed compatibility with PyMuPDF and PyPDF2 backends.

Related

  • Refers to internal LangChain doc ingestion issues when handling scanned PDFs
  • Supersedes ad-hoc error workarounds by offering centralized, reusable fix logic

google-labs-jules bot and others added 8 commits June 21, 2025 12:41
This commit addresses a `ValueError: cannot reshape array of size` that could occur in `PyMuPDFParser` and `PyPDFParser` when processing images embedded in PDF files.

The main causes were:
- Mismatch between image data length and dimensions (height, width, channels) declared in PDF metadata.
- Attempts to reshape data that was not a clean multiple of H*W*C.

Changes include:

For PyMuPDFParser:
- Validates `pix.samples` length against `pix.height * pix.width * pix.n`.
- Uses `pix.n` directly in reshape instead of -1 if valid (1,3,4).
- Adds fallback to infer channels if `pix.n` is problematic but data size is a multiple of H*W.
- Skips image with detailed warning if data is inconsistent.
- Converts numpy array to PIL Image and saves as PNG for the `images_parser`, instead of `numpy.save` as .npy.
- Handles RGBA images with fully opaque alpha by converting to RGB.
- Handles cases where pix.n might be 0 by attempting to infer channels.

For PyPDFParser:
- For images with filters in `_PDF_FILTER_WITHOUT_LOSS`:
    - Validates raw data length against `height * width`.
    - Infers channels from `actual_size / (height * width)`.
    - Skips image if inferred channels are not 1, 3, or 4.
    - Uses inferred channels in reshape.
- For images with filters in `_PDF_FILTER_WITH_LOSS` (e.g., JPEG):
    - Improves error handling around `Image.open()`.
- Standardizes post-processing for both filter types:
    - Converts numpy array (or PIL image from `Image.open`) to PIL Image.
    - Adjusts PIL image mode based on (inferred) channels (L, RGB, RGBA).
    - Handles RGBA images with fully opaque alpha by converting to RGB.
    - Saves image as PNG for the `images_parser`.
- Improved parsing of the `/Filter` attribute.
- Adds extensive logging for skipped or problematic images.

These changes make the PDF image extraction more robust by gracefully handling malformed or unusual image data instead of raising a ValueError, and by standardizing the image format passed to the subsequent image parser.
Fix: Handle image reshape errors in PDF parsers
This commit fixes an error in PyPDFParser where an attempt was made to access a non-existent `.name` attribute on a `pypdf.generic.NameObject` when extracting image filter names. The correct way to get the string value of a NameObject is to convert it directly to a string (e.g., `str(name_object)`).

Changes:
- Modified `PyPDFParser.extract_images_from_page` to use `str(filter_name_obj)` to get the filter name string (e.g., "/FlateDecode").
- Ensured that the leading slash is removed from the filter name.
- Added checks for empty filter arrays and non-NameObject elements within filter arrays.

This resolves the error "AttributeError: 'NameObject' object has no attribute 'name'" and improves the robustness of image filter parsing.
Fix: Correct NameObject handling in PyPDFParser for image filters
This commit addresses an issue where CCITTFaxDecode-compressed images
were incorrectly processed by the np.frombuffer().reshape() logic in
PyPDFParser. This happened because the reshape logic assumes raw pixel
data with byte-aligned pixels, which is not true for 1-bit CCITTFaxDecode
data, and also because pypdf's get_data() might return compressed or
partially processed streams for such filters.

Changes:
- Modified `PyPDFParser.extract_images_from_page`:
    - Created a `_FILTERS_HANDLED_BY_PILLOW` list that includes
      `DCTDecode`, `JPXDecode` (from original _PDF_FILTER_WITH_LOSS)
      and now also `CCITTFaxDecode` and `JBIG2Decode`.
    - Images with filters in this list are now processed using
      `Image.open(io.BytesIO(raw_data))`, leveraging Pillow's
      decoding capabilities.
    - Added handling for `mode='1'` (1-bit images) when determining
      channels after Pillow processing.
- The `np.frombuffer().reshape()` path is now reserved for other filters
  in `_PDF_FILTER_WITHOUT_LOSS` (e.g., FlateDecode, LZWDecode) where
  the output of `get_data()` is more likely to be raw pixel data suitable
  for direct reshaping.
- Added a safeguard in the reshape path to skip if CCITTFaxDecode or
  JBIG2Decode inadvertently reach it.

This change makes image extraction more robust for PDFs containing
CCITTFaxDecode and JBIG2Decode encoded images.
Fix: Route CCITTFaxDecode and JBIG2Decode to Pillow in PyPDFParser
This commit further enhances image extraction robustness in `PyPDFParser`
by routing more filter types to be processed by Pillow (`Image.open()`).
This addresses issues where filters like `FlateDecode` might produce
a complete image file stream (e.g., a PNG) rather than raw pixel data
suitable for direct `reshape`.

Changes:
- Expanded the `_FILTERS_HANDLED_BY_PILLOW` list in
  `PyPDFParser.extract_images_from_page` to include:
    - `FlateDecode` (and its abbreviation `Fl`)
    - `LZWDecode` (and its abbreviation `LZW`)
    - `RunLengthDecode` (and its abbreviation `RL`)
- This means that data streams processed with these filters will now
  first be attempted to be opened by Pillow. This is more likely to
  succeed if the stream is a standard image format (e.g., PNG, TIFF, GIF).
- Enhanced the Pillow processing path to better handle various image modes
  encountered after `Image.open()`, including converting Palette (`P`)
  images and other modes (like `CMYK`) to `RGB` or `RGBA` before
  converting to a NumPy array.
- The `np.frombuffer().reshape()` path is now reserved for a much smaller
  set of circumstances, primarily for filters not explicitly handled by
  Pillow where the data might be raw, unformatted pixel data.

This change significantly increases the chances of successfully extracting
images that use common non-lossy compression filters which might wrap
standard image file formats, reducing `reshape` errors and improving
overall image extraction reliability.
Fix: Route more filters (FlateDecode, LZW, RLE) to Pillow in PyPDFParser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant