Custom ingestion. Html tables generated too wordy and inefficient. #2306

evan2k · 2025-01-27T08:56:36Z

It seems that after calling Document Intelligence, python code would convert json tables returned to html tables. However, this seems inefficient when sent to the LLM, as html is too wordy, not only costing more in terms of tokens send, but also many medium sized tables are sent "broken" to the LLM due to chunking.
Should the app be modified to use something more efficient like csv or markup instead of html? The viewing experience of course will be affected.
Your thoughts? Thank you.

pamelafox · 2025-01-28T18:25:27Z

We did look into this a bit, here's some relevant research on HTML vs plaintext vs markdown:
https://arxiv.org/abs/2411.02959
https://arxiv.org/abs/2406.08100

The reason that we're currently picking HTML for tables is that HTML can convey more complex table structure (like row spans and cell spans) and the research seems to support using HTML for richer information.

I think we could do a better job with the chunking, however. We currently have logic that will start a new chunk if it knows that it had to break a chunk, which means we do ultimately store the full table, but the broken chunks are awkward. I think we'd have similar brokenness with markdown tables, too, by the way.

I think of a few approaches:

Allow chunks with tables to exceed the token size. This increase costs and risks exceeding context windows, depending how much it exceeds.
Implement smarter table breaking, only after a tr, with a note that says "table has been truncated"

It'd be good to know what other chunking frameworks are doing, like Langchain.

evan2k · 2025-01-28T18:40:01Z

Thank you for your comments. I guess ultimately it depends on the nature of your data, if you have many medium to large grid/excel like tabular data you may consider changing to a less wordy structure.
I was experimenting with table with 20-30 rows that listed type of materials used for a product. When asking the LLM to list all materials used for the product, it listed only the materials contained in one chunk, ignoring the rest, but did state that there may be other materials not listed

pamelafox · 2025-01-28T18:48:25Z

Yeah, makes sense. This is the method that would need changing:

DocumentAnalysisParser.table_to_html() in pdfparser.py

You could put a table_to_csv() in there and try that instead.

If your table are still getting split, then you can modify the splitting code, look for
"last_figure_start = section_text.rfind("<figure")" in textsplitter.py

We could add it as an option to the main repo if you make it configurable via an environment variable.

ms-johnalex · 2025-01-28T18:49:50Z

Yeah, makes sense. This is the method that would need changing:

> DocumentAnalysisParser.table_to_html() in pdfparser.py

You could put a table_to_csv() in there and try that instead.

If your table are still getting split, then you can modify the splitting code, look for "last_figure_start = section_text.rfind("<figure")" in textsplitter.py

We could add it as an option to the main repo if you make it configurable via an environment variable.

How does it do with complex pdf tables?

evan2k · 2025-01-28T18:54:46Z

I only tested with simple grid like long but simple tables so far.

ms-johnalex · 2025-01-28T18:58:17Z

I'm playing around with https://github.com/DS4SD/docling. I'll let you know my results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom ingestion. Html tables generated too wordy and inefficient. #2306

Custom ingestion. Html tables generated too wordy and inefficient. #2306

evan2k commented Jan 27, 2025 •

edited

Loading

pamelafox commented Jan 28, 2025

evan2k commented Jan 28, 2025 •

edited

Loading

pamelafox commented Jan 28, 2025

ms-johnalex commented Jan 28, 2025

evan2k commented Jan 28, 2025

ms-johnalex commented Jan 28, 2025

Custom ingestion. Html tables generated too wordy and inefficient. #2306

Custom ingestion. Html tables generated too wordy and inefficient. #2306

Comments

evan2k commented Jan 27, 2025 • edited Loading

pamelafox commented Jan 28, 2025

evan2k commented Jan 28, 2025 • edited Loading

pamelafox commented Jan 28, 2025

ms-johnalex commented Jan 28, 2025

evan2k commented Jan 28, 2025

ms-johnalex commented Jan 28, 2025

evan2k commented Jan 27, 2025 •

edited

Loading

evan2k commented Jan 28, 2025 •

edited

Loading