Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108

Open
wissamharoun opened this issue Nov 26, 2024 · 1 comment

Comments

@wissamharoun
Copy link

after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'

subsequently, when invoking
lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to
output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)

which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution

environment
macos 15.x
llmware v 0.3.8
db in use: sqlite

Screenshot 2024-11-26 at 16 29 01

Screenshot 2024-11-26 at 15 23 37

Processing image 82: image23_1.emf

DEBUG: OCR Error occurred: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
DEBUG: Error type: <class 'pytesseract.pytesseract.TesseractError'>
DEBUG: Full traceback:
Traceback (most recent call last):
  File "/Users/user_xyz/project_directory/debugging_lib_run_ocr.py", line 28, in <module>
    lib.run_ocr_on_images(min_size=10, chunk_size=400, realtime_progress=True, add_to_library=False)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/library.py", line 1246, in run_ocr_on_images
    output = Parser(library=self).ocr_images_in_library(add_to_library=add_to_library,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4620, in ocr_images_in_library
    output = ImageParser(text_chunk_size=chunk_size).process_ocr(image_path, img_name, preserve_spacing=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4709, in process_ocr
    text_out = pytesseract.image_to_string(os.path.join(dir_fp,fn))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 486, in image_to_string
    return {
           ^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 489, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')

@doberst
Copy link
Contributor

doberst commented Dec 2, 2024

@wissamharoun - this is a great point - let me dig into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants