Skip to content

Why we don't use Pymudf or pdfium for data extraction ? OCR takes lot of time #296

Closed Answered by cau-git
suryadev777 asked this question in Q&A
Discussion options

You must be logged in to vote

@suryadev777 We are applying OCR only in cases where we detect a bitmap resource embedded in the PDF by default. This is to accomodate scanned PDFs.

For the bulk of programmatic PDFs, we do not use OCR. We implemented an own PDF parser (https://github.com/DS4SD/docling-parse) for text element extraction, which is an open alternative to pymupdf and pypdfium. Also, we offer a pypdfium-based PDF parser backend as an alternative choice in docling, see here.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@suryadev777
Comment options

@cau-git
Comment options

@suryadev777
Comment options

Answer selected by cau-git
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants