You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text
#1107
Open
wissamharoun opened this issue
Nov 26, 2024
· 1 comment
Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text
simply as described.
appears to be more associated with the parsing of pdf documents that have entire pages comprised of a scanned image
are these types of record included in embedding? if so, problematic, right?
macos 15.x
llmware v 0.3.8
active_db: sqlite
The text was updated successfully, but these errors were encountered:
@wissamharoun - thanks for this detailed feedback, and yes, I confirm that there are scenarios in which the parser may create a text chunk larger than the requested max text chunk. I would encourage you to look at this example (if you have not already) - pdf_parser_configs ... The most common situations are with an embedded scanned image or a table where it is difficult to apply a hard cut-off at a specific character limit. Depending upon your use case, you may have to build some custom limit handling or safeguards in the downstream processing. Based on what problems you may be experiencing, we can make enhancements to llmware - and happy to work with you on it - just let me know what specific challenges it is creating in your use case.
Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text
simply as described.
appears to be more associated with the parsing of pdf documents that have entire pages comprised of a scanned image
are these types of record included in embedding? if so, problematic, right?
macos 15.x
llmware v 0.3.8
active_db: sqlite
The text was updated successfully, but these errors were encountered: