Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbered headings in Word documents appear as list items #612

Open
mattmalcher opened this issue Dec 16, 2024 · 3 comments
Open

Numbered headings in Word documents appear as list items #612

mattmalcher opened this issue Dec 16, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@mattmalcher
Copy link

mattmalcher commented Dec 16, 2024

First off, thank you for docling! <3

A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.

Bug

Lots of long technical documents use multilevel lists in word to have numbered sections.

These documents sometimes also include numbered paragraphs.

At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.

see:

if numid is not None and ilevel is not None:
self.add_listitem(
element,
docx_obj,
doc,
p_style_id,
p_level,
numid,
ilevel,
text,
is_numbered,
)
self.update_history(p_style_id, p_level, numid, ilevel)
return
elif numid is None and self.prev_numid() is not None: # Close list
for key, val in self.parents.items():
if key >= self.level_at_new_list:
self.parents[key] = None
self.level = self.level_at_new_list - 1
self.level_at_new_list = None
if p_style_id in ["Title"]:
for key, val in self.parents.items():
self.parents[key] = None
self.parents[0] = doc.add_text(
parent=None, label=DocItemLabel.TITLE, text=text
)
elif "Heading" in p_style_id:
self.add_header(element, docx_obj, doc, p_style_id, p_level, text)
elif p_style_id in [
"Paragraph",
"Normal",
"Subtitle",
"Author",
"DefaultText",
"ListParagraph",
"ListBullet",
"Quote",
]:
level = self.get_level()
doc.add_text(
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text
)
else:
# Text style names can, and will have, not only default values but user values too
# hence we treat all other labels as pure text
level = self.get_level()
doc.add_text(
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text
)
self.update_history(p_style_id, p_level, numid, ilevel)
return

So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.

I have had a go at adding a failing unit test, by adding a modified copy of unit_test_headers.docx and the expected ground truths for this case in a fork here: a544360

Have also attached the same example to this issue: unit_test_headers_numbered.docx

Current output:

# Test Document

- Section 1

Paragraph 1.1

Paragraph 1.2

Expected output:

# Test Document
## 1. Section 1

Paragraph 1.1

Paragraph 1.2

Steps to reproduce

Parse a word document with numbered headings like: unit_test_headers_numbered.docx

Docling version

Docling version: 2.12.0
Docling Core version: 2.9.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.3

@mattmalcher mattmalcher added the bug Something isn't working label Dec 16, 2024
@mattmalcher
Copy link
Author

I think there is also a related issue where sometimes the first item of a list that is within a numbered heading section will go missing.

If useful I can create a failing test for that too?

@cau-git
Copy link
Contributor

cau-git commented Dec 18, 2024

@mattmalcher If you can provide us with failing tests that would be very helpful for checking, thanks.

@mattmalcher
Copy link
Author

mattmalcher commented Dec 19, 2024

I have added two failing tests, with ground truths in a branch in a fork here: https://github.com/mattmalcher/docling/tree/issue_612_docx_numbered_headings

For the issue with text going missing where numbered headings are involved:

Original Document
image

Expected (Markdown)
image

Actual (Markdown)
Note that heading 1.2 here has gone altogether!
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants