You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.
issue: run lib.add_files()
and ingest documents that the C parser will extract images pending downstream OCR with lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory, lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good
at Query time - Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)
would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.
the following images show this clearly...
The text was updated successfully, but these errors were encountered:
well - that is odd.
seems like even though the llmware schema (in configs.py) defines the name of the key for the parsed chunks as 'text_block' - and in sqlite (and possibly postgres also) the name of that key is indeed 'text_block' -- however to address/manipulate/update that key programmatically one must use the key name 'text'
environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.
issue:
run lib.add_files()
and ingest documents that the C parser will extract images pending downstream OCR with
lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's
lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good
at Query time -
Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)
would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.
the following images show this clearly...
The text was updated successfully, but these errors were encountered: