Long PDFToTextOCRConverter conversion times #4232

cwfparsonson · 2023-02-20T10:30:11Z

cwfparsonson
Feb 20, 2023

Describe the bug
PDFToTextOCRConverter.convert() takes a long time even on small PDFs with only a few pages (see example below).

Is there any way to speed this up? For instance, could each page be converted in parallel?

Additional context
Here is an example scanned PDF which is only a few pages, all black and white, and only contains scanned text, so I'd have thought it would not be so slow to process: https://drive.google.com/file/d/1RvW0cPS1gIG9ZuafgocOfAc05kmoQtYu/view?usp=sharing

To Reproduce

from haystack.nodes import PDFToTextOCRConverter
import time

# Get path to scanned PDF
path_to_file = 'US3864478A_Original_document_20230220002104.pdf'

# Init PDF to text converter
converter = PDFToTextOCRConverter(remove_numeric_tables=False, valid_languages=['eng'])

# Convert PDF to text
start_t = time.time()
docs = converter.convert(file_path=path_to_file, meta=None)
print(f'Time to convert PDF to text: {time.time() - start_t:.3f} s')

Output:

tesseract 4.1.1
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
WARNING:haystack.nodes.file_converter.image:The language for image is not one of ['eng']. The file may not have been decoded in the correct text format.
WARNING:haystack.nodes.file_converter.image:The language for image is not one of ['eng']. The file may not have been decoded in the correct text format.
WARNING:haystack.nodes.file_converter.image:The language for image is not one of ['eng']. The file may not have been decoded in the correct text format.
WARNING:haystack.nodes.file_converter.image:The language for image is not one of ['eng']. The file may not have been decoded in the correct text format.
Time to convert PDF to text: 43.943 s

FAQ Check

[ Yes] Have you had a look at our new FAQ page?

System:

OS: CentOS Stream 8
GPU/CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Haystack version (commit or version number): 1.13.2

cwfparsonson · 2023-02-23T08:11:35Z

cwfparsonson
Feb 23, 2023
Author

Is it possible to run the OCR process in parallel for each page? I.e. to first extract the PDF pages so I can call converter.convert() in parallel processes?

1 reply

bilgeyucel Feb 23, 2023
Maintainer

Hey @cwfparsonson, you're right. PDFToTextOCRConverter takes some time to convert PDFs as it uses pytesseract library behind the scenes to make character recognition from scanned files like yours. I am aware that it's not always an option, but I suggest you to use PDFToTextConverter as much as possible to convert PDFs.
For parallelization, I opened up an issue and we'll check what we can do about it. Issue: #4257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long PDFToTextOCRConverter conversion times #4232

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Long PDFToTextOCRConverter conversion times #4232

cwfparsonson Feb 20, 2023

Replies: 1 comment · 1 reply

cwfparsonson Feb 23, 2023 Author

bilgeyucel Feb 23, 2023 Maintainer

cwfparsonson
Feb 20, 2023

Replies: 1 comment 1 reply

cwfparsonson
Feb 23, 2023
Author

bilgeyucel Feb 23, 2023
Maintainer