Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

wissamharoun · 2024-12-02T23:15:51Z

environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.

issue:
run lib.add_files()

and ingest documents that the C parser will extract images pending downstream OCR with lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's
lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good

at Query time -
Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)

would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.

the following images show this clearly...

The text was updated successfully, but these errors were encountered:

wissamharoun · 2024-12-12T16:36:15Z

well - that is odd.
seems like even though the llmware schema (in configs.py) defines the name of the key for the parsed chunks as 'text_block' - and in sqlite (and possibly postgres also) the name of that key is indeed 'text_block' -- however to address/manipulate/update that key programmatically one must use the key name 'text'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

wissamharoun commented Dec 2, 2024

wissamharoun commented Dec 12, 2024

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Comments

wissamharoun commented Dec 2, 2024

wissamharoun commented Dec 12, 2024