Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text #1107

Open
wissamharoun opened this issue Nov 26, 2024 · 1 comment

Comments

@wissamharoun
Copy link

wissamharoun commented Nov 26, 2024

Library.add_files(params, max_chunk_size=n) often creates record in db where chunk size vastly exceed n - often representing an entire document page of text

simply as described.
appears to be more associated with the parsing of pdf documents that have entire pages comprised of a scanned image
are these types of record included in embedding? if so, problematic, right?

macos 15.x
llmware v 0.3.8
active_db: sqlite

@doberst
Copy link
Contributor

doberst commented Dec 2, 2024

@wissamharoun - thanks for this detailed feedback, and yes, I confirm that there are scenarios in which the parser may create a text chunk larger than the requested max text chunk. I would encourage you to look at this example (if you have not already) - pdf_parser_configs ... The most common situations are with an embedded scanned image or a table where it is difficult to apply a hard cut-off at a specific character limit. Depending upon your use case, you may have to build some custom limit handling or safeguards in the downstream processing. Based on what problems you may be experiencing, we can make enhancements to llmware - and happy to work with you on it - just let me know what specific challenges it is creating in your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants