You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record)
#1108
Open
wissamharoun opened this issue
Nov 26, 2024
· 1 comment
after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'
subsequently, when invoking lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)
which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution
environment
macos 15.x
llmware v 0.3.8
db in use: sqlite
Processing image 82: image23_1.emf
DEBUG: OCR Error occurred: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
DEBUG: Error type: <class 'pytesseract.pytesseract.TesseractError'>
DEBUG: Full traceback:
Traceback (most recent call last):
File "/Users/user_xyz/project_directory/debugging_lib_run_ocr.py", line 28, in <module>
lib.run_ocr_on_images(min_size=10, chunk_size=400, realtime_progress=True, add_to_library=False)
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/library.py", line 1246, in run_ocr_on_images
output = Parser(library=self).ocr_images_in_library(add_to_library=add_to_library,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4620, in ocr_images_in_library
output = ImageParser(text_chunk_size=chunk_size).process_ocr(image_path, img_name, preserve_spacing=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4709, in process_ocr
text_out = pytesseract.image_to_string(os.path.join(dir_fp,fn))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 486, in image_to_string
return {
^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 489, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 352, in run_and_get_output
run_tesseract(**kwargs)
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
The text was updated successfully, but these errors were encountered:
after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'
subsequently, when invoking
lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to
output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)
which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution
environment
macos 15.x
llmware v 0.3.8
db in use: sqlite
The text was updated successfully, but these errors were encountered: