You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.
Bug
Lots of long technical documents use multilevel lists in word to have numbered sections.
These documents sometimes also include numbered paragraphs.
At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.
So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.
I have had a go at adding a failing unit test, by adding a modified copy of unit_test_headers.docx and the expected ground truths for this case in a fork here: a544360
First off, thank you for docling! <3
A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.
Bug
Lots of long technical documents use multilevel lists in word to have numbered sections.
These documents sometimes also include numbered paragraphs.
At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.
see:
docling/docling/backend/msword_backend.py
Lines 244 to 297 in 3bb3bf5
So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.
I have had a go at adding a failing unit test, by adding a modified copy of
unit_test_headers.docx
and the expected ground truths for this case in a fork here: a544360Have also attached the same example to this issue: unit_test_headers_numbered.docx
Current output:
Expected output:
Steps to reproduce
Parse a word document with numbered headings like: unit_test_headers_numbered.docx
Docling version
Python version
Python 3.12.3
The text was updated successfully, but these errors were encountered: