-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom ingestion. Html tables generated too wordy and inefficient. #2306
Comments
We did look into this a bit, here's some relevant research on HTML vs plaintext vs markdown: The reason that we're currently picking HTML for tables is that HTML can convey more complex table structure (like row spans and cell spans) and the research seems to support using HTML for richer information. I think we could do a better job with the chunking, however. We currently have logic that will start a new chunk if it knows that it had to break a chunk, which means we do ultimately store the full table, but the broken chunks are awkward. I think we'd have similar brokenness with markdown tables, too, by the way. I think of a few approaches:
It'd be good to know what other chunking frameworks are doing, like Langchain. |
Thank you for your comments. I guess ultimately it depends on the nature of your data, if you have many medium to large grid/excel like tabular data you may consider changing to a less wordy structure. |
Yeah, makes sense. This is the method that would need changing: DocumentAnalysisParser.table_to_html() in pdfparser.py You could put a table_to_csv() in there and try that instead. If your table are still getting split, then you can modify the splitting code, look for We could add it as an option to the main repo if you make it configurable via an environment variable. |
> DocumentAnalysisParser.table_to_html() in pdfparser.py
How does it do with complex pdf tables? |
I only tested with simple grid like long but simple tables so far. |
I'm playing around with https://github.com/DS4SD/docling. I'll let you know my results |
It seems that after calling Document Intelligence, python code would convert json tables returned to html tables. However, this seems inefficient when sent to the LLM, as html is too wordy, not only costing more in terms of tokens send, but also many medium sized tables are sent "broken" to the LLM due to chunking.
Should the app be modified to use something more efficient like csv or markup instead of html? The viewing experience of course will be affected.
Your thoughts? Thank you.
The text was updated successfully, but these errors were encountered: