do_cell_matching, do_table_structure, and do_ocr Pipeline Options Deeper Explanation #727

JoeHelbing · 2025-01-10T18:54:17Z

JoeHelbing
Jan 10, 2025

Prefacing this with having done a search on discussions, and looked quite deeply at the documentation and source code, but I can't quite wrap my head around what these attributes do. Perhaps it's assumed knowledge in PDF extraction or NLP I'm not familiar with? Or I'm blind and don't see what's obvious.

Can anyone give me a more technical/deeper explanation of what these flags do_cell_matching, do_table_structure, and do_ocr do when setting up PdfPipelineOptions?

Edit I have a vague idea I think, more-so with do_ocr that it forces actual OCR from image data rather than pulling encoded text from the PDF itself, and I'm assuming do_cell_matching and do_table_structure do something similar in theory to this...

Does leaving do_ocr as False then mean it ONLY pulls encoded text and doesn't look for image based text? Or does it do encoded text and OCR simultaneously?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do_cell_matching, do_table_structure, and do_ocr Pipeline Options Deeper Explanation #727

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

do_cell_matching, do_table_structure, and do_ocr Pipeline Options Deeper Explanation #727

JoeHelbing Jan 10, 2025

Replies: 0 comments

JoeHelbing
Jan 10, 2025