Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom ingestion. Html tables generated too wordy and inefficient. #2306

Open
evan2k opened this issue Jan 27, 2025 · 6 comments
Open

Custom ingestion. Html tables generated too wordy and inefficient. #2306

evan2k opened this issue Jan 27, 2025 · 6 comments

Comments

@evan2k
Copy link

evan2k commented Jan 27, 2025

It seems that after calling Document Intelligence, python code would convert json tables returned to html tables. However, this seems inefficient when sent to the LLM, as html is too wordy, not only costing more in terms of tokens send, but also many medium sized tables are sent "broken" to the LLM due to chunking.
Should the app be modified to use something more efficient like csv or markup instead of html? The viewing experience of course will be affected.
Your thoughts? Thank you.

@pamelafox
Copy link
Collaborator

We did look into this a bit, here's some relevant research on HTML vs plaintext vs markdown:
https://arxiv.org/abs/2411.02959
https://arxiv.org/abs/2406.08100

The reason that we're currently picking HTML for tables is that HTML can convey more complex table structure (like row spans and cell spans) and the research seems to support using HTML for richer information.

I think we could do a better job with the chunking, however. We currently have logic that will start a new chunk if it knows that it had to break a chunk, which means we do ultimately store the full table, but the broken chunks are awkward. I think we'd have similar brokenness with markdown tables, too, by the way.

I think of a few approaches:

  • Allow chunks with tables to exceed the token size. This increase costs and risks exceeding context windows, depending how much it exceeds.
  • Implement smarter table breaking, only after a tr, with a note that says "table has been truncated"

It'd be good to know what other chunking frameworks are doing, like Langchain.

@evan2k
Copy link
Author

evan2k commented Jan 28, 2025

Thank you for your comments. I guess ultimately it depends on the nature of your data, if you have many medium to large grid/excel like tabular data you may consider changing to a less wordy structure.
I was experimenting with table with 20-30 rows that listed type of materials used for a product. When asking the LLM to list all materials used for the product, it listed only the materials contained in one chunk, ignoring the rest, but did state that there may be other materials not listed

@pamelafox
Copy link
Collaborator

Yeah, makes sense. This is the method that would need changing:

DocumentAnalysisParser.table_to_html() in pdfparser.py

You could put a table_to_csv() in there and try that instead.

If your table are still getting split, then you can modify the splitting code, look for
"last_figure_start = section_text.rfind("<figure")" in textsplitter.py

We could add it as an option to the main repo if you make it configurable via an environment variable.

@ms-johnalex
Copy link
Contributor

Yeah, makes sense. This is the method that would need changing:

> DocumentAnalysisParser.table_to_html() in pdfparser.py

You could put a table_to_csv() in there and try that instead.

If your table are still getting split, then you can modify the splitting code, look for "last_figure_start = section_text.rfind("<figure")" in textsplitter.py

We could add it as an option to the main repo if you make it configurable via an environment variable.

How does it do with complex pdf tables?

@evan2k
Copy link
Author

evan2k commented Jan 28, 2025

I only tested with simple grid like long but simple tables so far.

@ms-johnalex
Copy link
Contributor

I'm playing around with https://github.com/DS4SD/docling. I'll let you know my results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants