Replies: 1 comment 2 replies
-
@naufalso This is high on our list as well. We have the hierarchy for docx and html, and are now working on adding it to the pdf. The problem with the latter is that section-headers are detected via object detection, and we have a-priori no information what the level is. We are trying to first use the table-of-contents in pdf, but hopefully soon, we will have a more dedicated model for this. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
First, I want to express my gratitude to the team for creating such an impressive tool! Docling has been incredibly useful in converting documents to Markdown format with ease and precision. Your efforts in building this robust solution are greatly appreciated.
While using Docling, I noticed that the Markdown output consistently uses second-level headings (
##
) for splitting sections. This approach works well for many scenarios, but I wonder if it is intended behavior.For my use case, preserving the original header structure of a PDF (e.g., chapters, sections, and subsections) in the Markdown output is essential. Maintaining this hierarchy would allow for more nuanced data splitting by headers while keeping the context intact. This feature would be particularly useful when leveraging tools like LangChain's
MarkdownHeaderTextSplitter
.Is there a way to configure Docling to maintain the original document's header hierarchy in the Markdown output? If this feature isn’t currently available, are there plans to support it in future updates?
Thank you again for your excellent work and for considering this feature request!
Beta Was this translation helpful? Give feedback.
All reactions