Maintaining Header Hierarchy in Markdown Output for Enhanced Section Splitting #386

naufalso · 2024-11-20T00:50:43Z

naufalso
Nov 20, 2024

Hello,

First, I want to express my gratitude to the team for creating such an impressive tool! Docling has been incredibly useful in converting documents to Markdown format with ease and precision. Your efforts in building this robust solution are greatly appreciated.

While using Docling, I noticed that the Markdown output consistently uses second-level headings (##) for splitting sections. This approach works well for many scenarios, but I wonder if it is intended behavior.

For my use case, preserving the original header structure of a PDF (e.g., chapters, sections, and subsections) in the Markdown output is essential. Maintaining this hierarchy would allow for more nuanced data splitting by headers while keeping the context intact. This feature would be particularly useful when leveraging tools like LangChain's MarkdownHeaderTextSplitter.

Is there a way to configure Docling to maintain the original document's header hierarchy in the Markdown output? If this feature isn’t currently available, are there plans to support it in future updates?

Thank you again for your excellent work and for considering this feature request!

PeterStaar-IBM · 2024-11-20T03:23:54Z

PeterStaar-IBM
Nov 20, 2024
Maintainer

@naufalso This is high on our list as well. We have the hierarchy for docx and html, and are now working on adding it to the pdf. The problem with the latter is that section-headers are detected via object detection, and we have a-priori no information what the level is.

We are trying to first use the table-of-contents in pdf, but hopefully soon, we will have a more dedicated model for this.

5 replies

puppetm4st3r Dec 16, 2024

it is a really hard challange to infer the TOC of a pdf only with de layout detection, my best results was over statistical analysis of styles to infer TOC and multi heading level... it works well with non-scanned pdfs, I haven't had much time to develop the idea further, but beyond a fairly functional MVP, if you are interested we can join forces and if you want to move forward by inferring the TOC (multi heading level) with a statistics model + certain heuristics, I can contribute my progress to the project =)

PeterStaar-IBM Dec 17, 2024
Maintainer

That would be great!

naufalso Jan 8, 2025
Author

Hello,

I wanted to check if there are any updates regarding this issue. I’ve noticed that other PDF parser libraries, like Marker or MinerU, may have addressed this functionality.

Thank you.

PeterStaar-IBM Jan 8, 2025
Maintainer

Yes, there is a lot of progress on the back. We are working heavily to do first good eval sets and methods. Expect updates in the next weeks

naufalso Jan 8, 2025
Author

Thank you for the update! It's great to hear about the progress and the focus on creating quality evaluation sets and methods. I look forward to hearing more in the coming weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining Header Hierarchy in Markdown Output for Enhanced Section Splitting #386

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Maintaining Header Hierarchy in Markdown Output for Enhanced Section Splitting #386

naufalso Nov 20, 2024

Replies: 1 comment · 5 replies

PeterStaar-IBM Nov 20, 2024 Maintainer

puppetm4st3r Dec 16, 2024

PeterStaar-IBM Dec 17, 2024 Maintainer

naufalso Jan 8, 2025 Author

PeterStaar-IBM Jan 8, 2025 Maintainer

naufalso Jan 8, 2025 Author

naufalso
Nov 20, 2024

Replies: 1 comment 5 replies

PeterStaar-IBM
Nov 20, 2024
Maintainer

PeterStaar-IBM Dec 17, 2024
Maintainer

naufalso Jan 8, 2025
Author

PeterStaar-IBM Jan 8, 2025
Maintainer

naufalso Jan 8, 2025
Author