Skip to content

Commit

Permalink
feat: Updated Layout processing with forms and key-value areas (#530)
Browse files Browse the repository at this point in the history
* Upgraded Layout Postprocessing, sending old code back to ERZ

Signed-off-by: Christoph Auer <[email protected]>

* Implement hierachical cluster layout processing

Signed-off-by: Christoph Auer <[email protected]>

* Pass nested cluster processing through full pipeline

Signed-off-by: Christoph Auer <[email protected]>

* Pass nested clusters through GLM as payload

Signed-off-by: Christoph Auer <[email protected]>

* Move to_docling_document from ds-glm to this repo

Signed-off-by: Christoph Auer <[email protected]>

* Clean up imports again

Signed-off-by: Christoph Auer <[email protected]>

* feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI.
- Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run.
- Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting.
- Refactor the way how the docling-ibm-models are called to match the new init signature of models.
- Translate the accelerator options to the specific inputs for third-party models.
- Extend the docling CLI with parameters to set the num_threads and device.
- Add new unit tests.
- Write new example how to use the accelerator options.

* fix: Improve the pydantic objects in the pipeline_options and imports.

Signed-off-by: Nikos Livathinos <[email protected]>

* fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model

Signed-off-by: Nikos Livathinos <[email protected]>

* Updated test ground-truth

Signed-off-by: Christoph Auer <[email protected]>

* Updated test ground-truth (again), bugfix for empty layout

Signed-off-by: Christoph Auer <[email protected]>

* fix: Do proper check to set the device in EasyOCR, RapidOCR.

Signed-off-by: Nikos Livathinos <[email protected]>

* fix: Correct the way to set GPU for EasyOCR, RapidOCR

Signed-off-by: Nikos Livathinos <[email protected]>

* fix: Ocr AccleratorDevice

Signed-off-by: Nikos Livathinos <[email protected]>

* Merge pull request #556 from DS4SD/cau/layout-processing-improvement

feat: layout processing improvements and bugfixes

* Update lockfile

Signed-off-by: Christoph Auer <[email protected]>

* Update tests

Signed-off-by: Christoph Auer <[email protected]>

* Update HF model ref, reset test generate

Signed-off-by: Christoph Auer <[email protected]>

* Repin to release package versions

Signed-off-by: Christoph Auer <[email protected]>

* Many layout processing improvements, add document index type

Signed-off-by: Christoph Auer <[email protected]>

* Update pinnings to docling-core

Signed-off-by: Christoph Auer <[email protected]>

* Update test GT

Signed-off-by: Christoph Auer <[email protected]>

* Fix table box snapping

Signed-off-by: Christoph Auer <[email protected]>

* Fixes for cluster pre-ordering

Signed-off-by: Christoph Auer <[email protected]>

* Introduce OCR confidence, propagate to orphan in post-processing

Signed-off-by: Christoph Auer <[email protected]>

* Fix form and key value area groups

Signed-off-by: Christoph Auer <[email protected]>

* Adjust confidence in EasyOcr

Signed-off-by: Christoph Auer <[email protected]>

* Roll back CLI changes from main

Signed-off-by: Christoph Auer <[email protected]>

* Update test GT

Signed-off-by: Christoph Auer <[email protected]>

* Update docling-core pinning

Signed-off-by: Christoph Auer <[email protected]>

* Annoying fixes for historical python versions

Signed-off-by: Christoph Auer <[email protected]>

* Updated test GT for legacy

Signed-off-by: Christoph Auer <[email protected]>

* Comment cleanup

Signed-off-by: Christoph Auer <[email protected]>

---------

Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Nikos Livathinos <[email protected]>
Co-authored-by: Nikos Livathinos <[email protected]>
  • Loading branch information
cau-git and nikos-livathinos authored Dec 17, 2024
1 parent 00dec7a commit 60dc852
Show file tree
Hide file tree
Showing 56 changed files with 1,651 additions and 1,710 deletions.
9 changes: 8 additions & 1 deletion docling/datamodel/base_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ class Cluster(BaseModel):
bbox: BoundingBox
confidence: float = 1.0
cells: List[Cell] = []
children: List["Cluster"] = [] # Add child cluster support


class BasePageElement(BaseModel):
Expand All @@ -143,6 +144,12 @@ class LayoutPrediction(BaseModel):
clusters: List[Cluster] = []


class ContainerElement(
BasePageElement
): # Used for Form and Key-Value-Regions, only for typing.
pass


class Table(BasePageElement):
otsl_seq: List[str]
num_rows: int = 0
Expand Down Expand Up @@ -182,7 +189,7 @@ class PagePredictions(BaseModel):
equations_prediction: Optional[EquationPrediction] = None


PageElement = Union[TextElement, Table, FigureElement]
PageElement = Union[TextElement, Table, FigureElement, ContainerElement]


class AssembledUnit(BaseModel):
Expand Down
4 changes: 3 additions & 1 deletion docling/datamodel/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@

layout_label_to_ds_type = {
DocItemLabel.TITLE: "title",
DocItemLabel.DOCUMENT_INDEX: "table-of-contents",
DocItemLabel.DOCUMENT_INDEX: "table",
DocItemLabel.SECTION_HEADER: "subtitle-level-1",
DocItemLabel.CHECKBOX_SELECTED: "checkbox-selected",
DocItemLabel.CHECKBOX_UNSELECTED: "checkbox-unselected",
Expand All @@ -88,6 +88,8 @@
DocItemLabel.PICTURE: "figure",
DocItemLabel.TEXT: "paragraph",
DocItemLabel.PARAGRAPH: "paragraph",
DocItemLabel.FORM: DocItemLabel.FORM.value,
DocItemLabel.KEY_VALUE_REGION: DocItemLabel.KEY_VALUE_REGION.value,
}

_EMPTY_DOCLING_DOC = DoclingDocument(name="dummy")
Expand Down
2 changes: 2 additions & 0 deletions docling/datamodel/pipeline_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,8 @@ class EasyOcrOptions(OcrOptions):

use_gpu: Optional[bool] = None

confidence_threshold: float = 0.65

model_storage_directory: Optional[str] = None
recog_network: Optional[str] = "standard"
download_enabled: bool = True
Expand Down
1 change: 1 addition & 0 deletions docling/datamodel/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class DebugSettings(BaseModel):
visualize_cells: bool = False
visualize_ocr: bool = False
visualize_layout: bool = False
visualize_raw_layout: bool = False
visualize_tables: bool = False

profile_pipeline_timings: bool = False
Expand Down
38 changes: 34 additions & 4 deletions docling/models/ds_glm_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,15 @@
from docling_core.types.legacy_doc.document import CCSFileInfoObject as DsFileInfoObject
from docling_core.types.legacy_doc.document import ExportedCCSDocument as DsDocument
from PIL import ImageDraw
from pydantic import BaseModel, ConfigDict

from docling.datamodel.base_models import Cluster, FigureElement, Table, TextElement
from pydantic import BaseModel, ConfigDict, TypeAdapter

from docling.datamodel.base_models import (
Cluster,
ContainerElement,
FigureElement,
Table,
TextElement,
)
from docling.datamodel.document import ConversionResult, layout_label_to_ds_type
from docling.datamodel.settings import settings
from docling.utils.glm_utils import to_docling_document
Expand Down Expand Up @@ -204,7 +210,31 @@ def make_spans(cell):
)
],
obj_type=layout_label_to_ds_type.get(element.label),
# data=[[]],
payload={
"children": TypeAdapter(List[Cluster]).dump_python(
element.cluster.children
)
}, # hack to channel child clusters through GLM
)
)
elif isinstance(element, ContainerElement):
main_text.append(
BaseText(
text="",
payload={
"children": TypeAdapter(List[Cluster]).dump_python(
element.cluster.children
)
}, # hack to channel child clusters through GLM
obj_type=layout_label_to_ds_type.get(element.label),
name=element.label,
prov=[
Prov(
bbox=target_bbox,
page=element.page_no + 1,
span=[0, 0],
)
],
)
)

Expand Down
1 change: 1 addition & 0 deletions docling/models/easyocr_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ def __call__(
),
)
for ix, line in enumerate(result)
if line[2] >= self.options.confidence_threshold
]
all_ocr_cells.extend(cells)

Expand Down
Loading

0 comments on commit 60dc852

Please sign in to comment.