feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

nikos-livathinos · 2024-11-10T15:21:05Z

In certain occasions the user may want to force a full page OCR and ignore the text contained in a programmatic PDF (see issue #185).

This PR introduces the parameter OcrOptions.force_full_page_ocr that implements this feature.

Please check this example that demonstrates how to force OCR: https://github.com/DS4SD/docling/blob/force_ocr/docs/examples/full_page_ocr.py

Issue resolved by this Pull Request:
Resolves #185

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the
conventional commits.
Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

…t forces a full page OCR scanning and uses the recognized OCR cells instead of any existing PDF cells. Update unit tests. Signed-off-by: Nikos Livathinos <[email protected]>

Signed-off-by: Nikos Livathinos <[email protected]>

dolfim-ibm

Please add the example in the mkdocs index mkdocs.yml.

dolfim-ibm · 2024-11-10T20:47:58Z

docs/examples/full_page_ocr.py

+        format_options={
+            InputFormat.PDF: PdfFormatOption(
+                pipeline_options=pipeline_options,
+                backend=DoclingParseDocumentBackend,


I would remove the hard-coded backend, since it is an example about OCR.

PeterStaar-IBM

LGTM!

PeterStaar-IBM · 2024-11-11T08:38:36Z

@nikos-livathinos Maybe, let's add a cli parameter to enforce-ocr.

…ull_page_ocr.py example in mkdocs Signed-off-by: Nikos Livathinos <[email protected]>

docling/models/easyocr_model.py

Signed-off-by: Nikos Livathinos <[email protected]>

nikos-livathinos added 2 commits November 10, 2024 15:20

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter tha…

dea1d91

…t forces a full page OCR scanning and uses the recognized OCR cells instead of any existing PDF cells. Update unit tests. Signed-off-by: Nikos Livathinos <[email protected]>

chore(examples): Add example how to force OCR

1963e71

Signed-off-by: Nikos Livathinos <[email protected]>

nikos-livathinos self-assigned this Nov 10, 2024

nikos-livathinos requested review from PeterStaar-IBM, cau-git and dolfim-ibm November 10, 2024 15:21

nikos-livathinos mentioned this pull request Nov 10, 2024

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Closed

dolfim-ibm reviewed Nov 10, 2024

View reviewed changes

PeterStaar-IBM previously approved these changes Nov 11, 2024

View reviewed changes

maxmnemonic mentioned this pull request Nov 11, 2024

Handle vector-image-converted text in PDFs #261

Open

maxmnemonic previously approved these changes Nov 11, 2024

View reviewed changes

feat: Introduce the force-ocr cmd parameter in docling cli. Add the f…

7234dc3

…ull_page_ocr.py example in mkdocs Signed-off-by: Nikos Livathinos <[email protected]>

nikos-livathinos dismissed stale reviews from maxmnemonic and PeterStaar-IBM via 7234dc3 November 11, 2024 13:14

nikos-livathinos requested review from PeterStaar-IBM and dolfim-ibm November 11, 2024 13:16

cau-git reviewed Nov 11, 2024

View reviewed changes

docling/models/easyocr_model.py Outdated Show resolved Hide resolved

nikos-livathinos added 2 commits November 11, 2024 17:39

fix: Move common OCR code in the BaseOcrModel class

7a0f160

Signed-off-by: Nikos Livathinos <[email protected]>

Merge branch 'main' into force_ocr

088ce5f

cau-git approved these changes Nov 11, 2024

View reviewed changes

nikos-livathinos requested a review from cau-git November 11, 2024 21:20

nikos-livathinos merged commit c6b3763 into main Nov 12, 2024
7 checks passed

nikos-livathinos deleted the force_ocr branch November 12, 2024 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

nikos-livathinos commented Nov 10, 2024

dolfim-ibm left a comment

dolfim-ibm Nov 10, 2024

PeterStaar-IBM left a comment

PeterStaar-IBM commented Nov 11, 2024

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

feat(OCR): Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning #290

Conversation

nikos-livathinos commented Nov 10, 2024

dolfim-ibm left a comment

Choose a reason for hiding this comment

dolfim-ibm Nov 10, 2024

Choose a reason for hiding this comment

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

PeterStaar-IBM commented Nov 11, 2024