do_cell_matching, do_table_structure, and do_ocr Pipeline Options Deeper Explanation #727
Unanswered
JoeHelbing
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Prefacing this with having done a search on discussions, and looked quite deeply at the documentation and source code, but I can't quite wrap my head around what these attributes do. Perhaps it's assumed knowledge in PDF extraction or NLP I'm not familiar with? Or I'm blind and don't see what's obvious.
Can anyone give me a more technical/deeper explanation of what these flags
do_cell_matching, do_table_structure, and do_ocr
do when setting upPdfPipelineOptions
?Edit I have a vague idea I think, more-so with
do_ocr
that it forces actual OCR from image data rather than pulling encoded text from the PDF itself, and I'm assumingdo_cell_matching
anddo_table_structure
do something similar in theory to this...Does leaving
do_ocr
as False then mean it ONLY pulls encoded text and doesn't look for image based text? Or does it do encoded text and OCR simultaneously?Beta Was this translation helpful? Give feedback.
All reactions