You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docling creates well-structured output for many common item types with plain text-representation (paragraphs, headings, captions, lists, formulas), and tables, including their cell structure. We want to extend Docling to support creating well-structured output also for pictures, forms, and key-value regions. To address these types of content, there are a few gaps which need to be addressed.
Current situation
The layout model in docling-ibm-models is trained to recognize forms and key-value regions, but both classes are currently ignored in downstream processing. As a consequence, the items contained in a form or a key-value region are treated as individual items without that context. In key-value regions, the key and value text is therefore represented as plain text items, with no connection between key and value, and not ordered correctly. In forms, extra elements such as checkboxes, groups of choices, and other elements are likewise represented as plain text items without grouping or useful order.
For content detected as Picture item, the text content inside is ignored by default (even if OCR detected text inside). We have examples which outline how to build picture-enrichment models, but they are not used by default and don't exploit the known text content inside picture items so far (see here)
Planned extensions
This topic will require work on several steps to prepare docling for the additional content types.
Collecting form and key-value material for proper testing
Evaluate the accuracy of form and key-value region detection in the layout model (in docling-ibm-models), understand where it causes confusion (especially with table detections)
Improve the post-processing of the layout model to:
accept overlapping cluster proposals in the case of forms, key-value regions and pictures
Tune the confidence tresholds for the new classes
Create meaningful data structures for forms, key-value regions and picture content
Extend DoclingDocument with new data models for respective item types
Adapt code in the page assembler to create the new data structures (see here)
Update the reading-order and other post-processing models
Update exporter methods (e.g. to markdown, doctags, HTML)
Additionally, to ensure high-quality results for difficult samples, we will need to invest into the development (or third-party integration) of specialized models for form- and key-value understanding.
Questions to be answered
What is a useful data model for form objects, for key-value regions and for pictures, while keeping compatibility to the current item types
e.g. is a GroupItem sufficient?
How to link values to keys?
How should picture content be represented before applying enrichment pipelines?
As plain children with text items? Other?
Which reference models exist in the open literature / open-source for form- and key-value extraction, and could any be candidates to integrate?
More questions...
Everyone is invited to contribute to this discussion and provide feedback or examples from other solutions.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Docling creates well-structured output for many common item types with plain text-representation (paragraphs, headings, captions, lists, formulas), and tables, including their cell structure. We want to extend Docling to support creating well-structured output also for pictures, forms, and key-value regions. To address these types of content, there are a few gaps which need to be addressed.
Current situation
The layout model in
docling-ibm-models
is trained to recognize forms and key-value regions, but both classes are currently ignored in downstream processing. As a consequence, the items contained in a form or a key-value region are treated as individual items without that context. In key-value regions, the key and value text is therefore represented as plain text items, with no connection between key and value, and not ordered correctly. In forms, extra elements such as checkboxes, groups of choices, and other elements are likewise represented as plain text items without grouping or useful order.For content detected as Picture item, the text content inside is ignored by default (even if OCR detected text inside). We have examples which outline how to build picture-enrichment models, but they are not used by default and don't exploit the known text content inside picture items so far (see here)
Planned extensions
This topic will require work on several steps to prepare docling for the additional content types.
docling-ibm-models
), understand where it causes confusion (especially with table detections)DoclingDocument
with new data models for respective item typesAdditionally, to ensure high-quality results for difficult samples, we will need to invest into the development (or third-party integration) of specialized models for form- and key-value understanding.
Questions to be answered
GroupItem
sufficient?children
with text items? Other?Everyone is invited to contribute to this discussion and provide feedback or examples from other solutions.
Beta Was this translation helpful? Give feedback.
All reactions