Expose and add structured output for forms and key-value regions #301

cau-git · 2024-11-11T12:22:05Z

cau-git
Nov 11, 2024
Maintainer

Docling creates well-structured output for many common item types with plain text-representation (paragraphs, headings, captions, lists, formulas), and tables, including their cell structure. We want to extend Docling to support creating well-structured output also for pictures, forms, and key-value regions. To address these types of content, there are a few gaps which need to be addressed.

Current situation

The layout model in docling-ibm-models is trained to recognize forms and key-value regions, but both classes are currently ignored in downstream processing. As a consequence, the items contained in a form or a key-value region are treated as individual items without that context. In key-value regions, the key and value text is therefore represented as plain text items, with no connection between key and value, and not ordered correctly. In forms, extra elements such as checkboxes, groups of choices, and other elements are likewise represented as plain text items without grouping or useful order.

For content detected as Picture item, the text content inside is ignored by default (even if OCR detected text inside). We have examples which outline how to build picture-enrichment models, but they are not used by default and don't exploit the known text content inside picture items so far (see here)

Planned extensions

This topic will require work on several steps to prepare docling for the additional content types.

Collecting form and key-value material for proper testing
Evaluate the accuracy of form and key-value region detection in the layout model (in docling-ibm-models), understand where it causes confusion (especially with table detections)
Improve the post-processing of the layout model to:
- accept overlapping cluster proposals in the case of forms, key-value regions and pictures
- Tune the confidence tresholds for the new classes
Create meaningful data structures for forms, key-value regions and picture content
- Extend DoclingDocument with new data models for respective item types
- Adapt code in the page assembler to create the new data structures (see here)
Update the reading-order and other post-processing models
Update exporter methods (e.g. to markdown, doctags, HTML)

Additionally, to ensure high-quality results for difficult samples, we will need to invest into the development (or third-party integration) of specialized models for form- and key-value understanding.

Questions to be answered

What is a useful data model for form objects, for key-value regions and for pictures, while keeping compatibility to the current item types
- e.g. is a GroupItem sufficient?
- How to link values to keys?
How should picture content be represented before applying enrichment pipelines?
- As plain children with text items? Other?
Which reference models exist in the open literature / open-source for form- and key-value extraction, and could any be candidates to integrate?
More questions...

Everyone is invited to contribute to this discussion and provide feedback or examples from other solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose and add structured output for forms and key-value regions #301

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Expose and add structured output for forms and key-value regions #301

cau-git Nov 11, 2024 Maintainer

Current situation

Planned extensions

Questions to be answered

Replies: 0 comments

cau-git
Nov 11, 2024
Maintainer