Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

Merged
merged 5 commits into from
Nov 12, 2024

Conversation

nikos-livathinos
Copy link
Collaborator

In certain occasions the user may want to force a full page OCR and ignore the text contained in a programmatic PDF (see issue #185).

This PR introduces the parameter OcrOptions.force_full_page_ocr that implements this feature.

Please check this example that demonstrates how to force OCR: https://github.com/DS4SD/docling/blob/force_ocr/docs/examples/full_page_ocr.py

Issue resolved by this Pull Request:
Resolves #185

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

…t forces a full page OCR

scanning and uses the recognized OCR cells instead of any existing PDF cells. Update unit tests.

Signed-off-by: Nikos Livathinos <[email protected]>
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the example in the mkdocs index mkdocs.yml.

format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=DoclingParseDocumentBackend,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the hard-coded backend, since it is an example about OCR.

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Nov 11, 2024
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

maxmnemonic
maxmnemonic previously approved these changes Nov 11, 2024
@PeterStaar-IBM
Copy link
Contributor

@nikos-livathinos Maybe, let's add a cli parameter to enforce-ocr.

…ull_page_ocr.py example in mkdocs

Signed-off-by: Nikos Livathinos <[email protected]>
@nikos-livathinos nikos-livathinos merged commit c6b3763 into main Nov 12, 2024
7 checks passed
@nikos-livathinos nikos-livathinos deleted the force_ocr branch November 12, 2024 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied
5 participants