Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Open
wissamharoun opened this issue Dec 2, 2024 · 1 comment

Comments

@wissamharoun
Copy link

environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.

issue:
run lib.add_files()

and ingest documents that the C parser will extract images pending downstream OCR with lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's
lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good

at Query time -
Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)

would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.

the following images show this clearly...

Screenshot 2024-12-01 at 19 02 43

Screenshot 2024-12-02 at 17 25 10

Screenshot 2024-12-01 at 19 01 45

@wissamharoun
Copy link
Author

well - that is odd.
seems like even though the llmware schema (in configs.py) defines the name of the key for the parsed chunks as 'text_block' - and in sqlite (and possibly postgres also) the name of that key is indeed 'text_block' -- however to address/manipulate/update that key programmatically one must use the key name 'text'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant