Text missing or displaced in parsed table #621

pbonito · 2024-12-06T19:01:20Z

pbonito
Dec 6, 2024

Bug

Text in table not present or displaced in parsed document

Steps to reproduce

Parse this pdf
parser_test.pdf

Docling version

Docling version: 2.8.1
Docling Core version: 2.5.1
Docling IBM Models version: 2.0.6
Docling Parse version: 2.1.2

Python version

Python 3.11.5

maxmnemonic · 2024-12-09T15:06:51Z

maxmnemonic
Dec 9, 2024
Collaborator

(accidentally deleted previous comment 😮‍💨)
@pbonito , please try this conversion with accurate setting for Tableformer, while not perfect it produces better result on your document.

In Docling CLI:
docling parser_test.pdf --to html --table-mode accurate

In Python:

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
...

pipeline_options = PdfPipelineOptions(do_table_structure=True, table_structure_options=TableFormerMode.ACCURATE)
doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=StandardPdfPipeline,
                backend=DoclingParseDocumentBackend,
            )
        },
    )
)

This is the result (better than with fast model):

However, such tables are out of training distribution (hence mistakes) this we want to address with additional training data in the future.

0 replies

pbonito · 2024-12-10T13:39:22Z

pbonito
Dec 10, 2024
Author

@maxmnemonic I think it is
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
As you pointed out results improve but it is still missing one sentence. We can do more tests.
Can you share more on the strategy to improve results in future. Any plans?

0 replies

maxmnemonic · 2024-12-12T08:52:57Z

maxmnemonic
Dec 12, 2024
Collaborator

@pbonito, sure let me share some light on how it works, Tableformer is our model that we use in Docling to do table structure recognition. It's a single encoder / dual decoder model that trained to predict structural tags together with bounding boxes of the content. Docling then extracts text from a given bounding boxes and places it in the appropriate place in the structure. And as we see here those bounding boxes most likely fell short sometimes for such tables.

Thing is we trained Tableformer on public datasets such as FinTabNet and PubTab1M (and some others), these come from scientific papers, public financial reports, etc. Tables presented in such datasets while vary and enable model to do fairly good generalization, they miss some types of the tables, like the one we are looking at here, where there is a lot of text in each cell. Model was not used to see so much "volume" in each cell, and accuracy of predicted "content bounding box" drops.

Our current strategy is to fine tune model on the dataset that has such large-text tables in abundance, and in fact our team works on such dataset as we speak, so once we have dataset and do the fine-tuning we will just push new model weights that hopefully should improve the situation that we see here.

Hope this helps!

0 replies

itsainii · 2024-12-17T06:08:45Z

itsainii
Dec 17, 2024

@maxmnemonic Is there an estimated timeline for when the new model weights will be pushed? I'm just curious about the expected update.

0 replies

maxmnemonic · 2024-12-18T09:24:52Z

maxmnemonic
Dec 18, 2024
Collaborator

@itsainii don't want to give timelines at the moment, before we test and prove the improvements. But I hope very early next year.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text missing or displaced in parsed table #621

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text missing or displaced in parsed table #621

pbonito Dec 6, 2024

Bug

Steps to reproduce

Docling version

Python version

Replies: 5 comments

maxmnemonic Dec 9, 2024 Collaborator

pbonito Dec 10, 2024 Author

maxmnemonic Dec 12, 2024 Collaborator

itsainii Dec 17, 2024

maxmnemonic Dec 18, 2024 Collaborator

pbonito
Dec 6, 2024

maxmnemonic
Dec 9, 2024
Collaborator

pbonito
Dec 10, 2024
Author

maxmnemonic
Dec 12, 2024
Collaborator

itsainii
Dec 17, 2024

maxmnemonic
Dec 18, 2024
Collaborator