Why we don't use Pymudf or pdfium for data extraction ? OCR takes lot of time #296
-
for extracting data from a pdf why don't we use Pymudf or pdfium ? is there any drawback on them? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
@suryadev777 We are applying OCR only in cases where we detect a bitmap resource embedded in the PDF by default. This is to accomodate scanned PDFs. For the bulk of programmatic PDFs, we do not use OCR. We implemented an own PDF parser (https://github.com/DS4SD/docling-parse) for text element extraction, which is an open alternative to pymupdf and pypdfium. Also, we offer a pypdfium-based PDF parser backend as an alternative choice in docling, see here. |
Beta Was this translation helpful? Give feedback.
@suryadev777 We are applying OCR only in cases where we detect a bitmap resource embedded in the PDF by default. This is to accomodate scanned PDFs.
For the bulk of programmatic PDFs, we do not use OCR. We implemented an own PDF parser (https://github.com/DS4SD/docling-parse) for text element extraction, which is an open alternative to pymupdf and pypdfium. Also, we offer a pypdfium-based PDF parser backend as an alternative choice in docling, see here.