Why we don't use Pymudf or pdfium for data extraction ? OCR takes lot of time #296

suryadev777 · 2024-11-11T10:01:58Z

suryadev777
Nov 11, 2024

for extracting data from a pdf why don't we use Pymudf or pdfium ? is there any drawback on them?

Nov 11, 2024

@suryadev777 We are applying OCR only in cases where we detect a bitmap resource embedded in the PDF by default. This is to accomodate scanned PDFs.

For the bulk of programmatic PDFs, we do not use OCR. We implemented an own PDF parser (https://github.com/DS4SD/docling-parse) for text element extraction, which is an open alternative to pymupdf and pypdfium. Also, we offer a pypdfium-based PDF parser backend as an alternative choice in docling, see here.

View full answer

cau-git · 2024-11-11T12:34:48Z

cau-git
Nov 11, 2024
Maintainer

@suryadev777 We are applying OCR only in cases where we detect a bitmap resource embedded in the PDF by default. This is to accomodate scanned PDFs.

For the bulk of programmatic PDFs, we do not use OCR. We implemented an own PDF parser (https://github.com/DS4SD/docling-parse) for text element extraction, which is an open alternative to pymupdf and pypdfium. Also, we offer a pypdfium-based PDF parser backend as an alternative choice in docling, see here.

3 replies

suryadev777 Nov 12, 2024
Author

Thank you for the response but i have a few question

what is the reason to pick pypdfium as a default pdf parser ?
After extraction the pdf object how we classify its a paragraph , heading ? is simply based on size threshold?
how we detect its a bitmap or regular text ?

cau-git Nov 12, 2024
Maintainer

@suryadev777 I would advise you to read our technical report: https://arxiv.org/abs/2408.09869

To answer in short to your points:

pypdfium is not our default parser, it is docling-parse. We offer pypdfium as an alternative choice.
After extraction, we run a full pipeline of AI models for the tasks of layout detection, table structure recognition, and other features.
The PDF backend (docling-parse or pypdfium) have APIs to return bitmap rectangles on a given PDF page.

suryadev777 Nov 12, 2024
Author

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why we don't use Pymudf or pdfium for data extraction ? OCR takes lot of time #296

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why we don't use Pymudf or pdfium for data extraction ? OCR takes lot of time #296

suryadev777 Nov 11, 2024

Replies: 1 comment · 3 replies

cau-git Nov 11, 2024 Maintainer

suryadev777 Nov 12, 2024 Author

cau-git Nov 12, 2024 Maintainer

suryadev777 Nov 12, 2024 Author

suryadev777
Nov 11, 2024

Replies: 1 comment 3 replies

cau-git
Nov 11, 2024
Maintainer

suryadev777 Nov 12, 2024
Author

cau-git Nov 12, 2024
Maintainer

suryadev777 Nov 12, 2024
Author