lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

wissamharoun · 2024-11-26T21:40:49Z

after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'

subsequently, when invoking
lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to
output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)

which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution

environment
macos 15.x
llmware v 0.3.8
db in use: sqlite

Processing image 82: image23_1.emf

DEBUG: OCR Error occurred: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
DEBUG: Error type: <class 'pytesseract.pytesseract.TesseractError'>
DEBUG: Full traceback:
Traceback (most recent call last):
  File "/Users/user_xyz/project_directory/debugging_lib_run_ocr.py", line 28, in <module>
    lib.run_ocr_on_images(min_size=10, chunk_size=400, realtime_progress=True, add_to_library=False)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/library.py", line 1246, in run_ocr_on_images
    output = Parser(library=self).ocr_images_in_library(add_to_library=add_to_library,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4620, in ocr_images_in_library
    output = ImageParser(text_chunk_size=chunk_size).process_ocr(image_path, img_name, preserve_spacing=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4709, in process_ocr
    text_out = pytesseract.image_to_string(os.path.join(dir_fp,fn))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 486, in image_to_string
    return {
           ^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 489, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')

The text was updated successfully, but these errors were encountered:

doberst · 2024-12-02T18:05:55Z

@wissamharoun - this is a great point - let me dig into it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

wissamharoun commented Nov 26, 2024

doberst commented Dec 2, 2024

lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

Comments

wissamharoun commented Nov 26, 2024

doberst commented Dec 2, 2024