-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290
Conversation
…t forces a full page OCR scanning and uses the recognized OCR cells instead of any existing PDF cells. Update unit tests. Signed-off-by: Nikos Livathinos <[email protected]>
Signed-off-by: Nikos Livathinos <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the example in the mkdocs index mkdocs.yml.
docs/examples/full_page_ocr.py
Outdated
format_options={ | ||
InputFormat.PDF: PdfFormatOption( | ||
pipeline_options=pipeline_options, | ||
backend=DoclingParseDocumentBackend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the hard-coded backend, since it is an example about OCR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@nikos-livathinos Maybe, let's add a cli parameter to enforce-ocr. |
…ull_page_ocr.py example in mkdocs Signed-off-by: Nikos Livathinos <[email protected]>
7234dc3
Signed-off-by: Nikos Livathinos <[email protected]>
In certain occasions the user may want to force a full page OCR and ignore the text contained in a programmatic PDF (see issue #185).
This PR introduces the parameter
OcrOptions.force_full_page_ocr
that implements this feature.Please check this example that demonstrates how to force OCR: https://github.com/DS4SD/docling/blob/force_ocr/docs/examples/full_page_ocr.py
Issue resolved by this Pull Request:
Resolves #185
Checklist:
conventional commits.