Replies: 17 comments 4 replies
-
The current notebook covers splitting of oversized chunks such that they fit within a max number of tokens. If one wants to also merge "undersized" chunks, such that they better fit/approach that limit, the most straightforward case to consider is merging consecutive chunks of the same context, i.e. same "headings" and "captions" metadata in our example case. Wrapping any such implementation as a |
Beta Was this translation helpful? Give feedback.
-
I made a PR with an alternative notebook that does address the issue of merging within sections. As noted in the PR description, the major differences in the version I produced are:
Of these, I think 1, 2, and 4 are probably improvements over the version that @vagenas proposed. However, 3 should probably be undone since the titles will be in the headers list soon. Also, 5 is clearly a way that the one I am proposing is worse than the one @vagenas proposed, but I am not sure whether it is important enough to address in the one I am proposing. Also, there are lots of other minor technical differences (e.g., I have my own subclasses of BaseChunk and BaseMeta because I couldn't find a way to construct instances of the ones in the product) that would be good to get resolved. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
To illustrate my point 2 above, I went ahead and updated the notebook in my PR with new You can see the latest draft here: here The latest draft was rebased on the one that @vagenas put in his PR. It includes that addition. Also, I dropped the pip install of |
Beta Was this translation helpful? Give feedback.
-
I removed the use of DoclingDocument.name (which I was assuming to be the title) from my version of the notebook. As discussed above, that didn't turn out to be a good way to get a document title after all. |
Beta Was this translation helpful? Give feedback.
-
@jwm4 (cc @ceberam) Regarding Semchunk: it looks like too thin a layer for warranting an additional dependency. Perhaps we can understand how to incorporate some of the basic ideas instead? Regarding the final outcome: |
Beta Was this translation helpful? Give feedback.
-
We can leverage the frameworks, since they have implementations of semantic chunking using embedding models.
|
Beta Was this translation helpful? Give feedback.
-
@vagenas writes:
I acknowledge that it is only 300 lines of code, and the explanation of what it does is fairly simple. However, if you look closely at the benchmark results for semchunk (in their readme) they really are remarkable, which suggests that this is very well optimized code. I am not sure I understand @ceberam 's proposal to leverage LlamaIndex and/or LangChain. Dragging in dependencies on either or both of those packages seems like a very heavy dependency to add just to get a generic text chunker. Do we already have those dependencies? I thought the dependencies flowed the other way in the existing integrations, but I don't really know. |
Beta Was this translation helpful? Give feedback.
-
I have provided a new iteration on top of your last changes, with #310 — the notebook can be directly viewed here. As you see there, the main improvements are in:
🐛 I think a bug has been (and is still) there though: Perhaps some of the processing/merging/splitting steps are not properly taking into account the headings/captions — or the tokens added with |
Beta Was this translation helpful? Give feedback.
-
Hi. Sorry for the slow reply. It has been a crazy week. Here are my thoughts:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback, I have now merged all changes to a single branch to simplify:
Some more details:
Actually, I do agree we should keep
I am taking a step beyond (1) and expanding the chunks with explicit FYI: this differentiation between what is passed to the embed model vs what is passed to the gen model is also present in frameworks like LlamaIndex — albeit in a different fashion (ref), so integration may also require some further changes. |
Beta Was this translation helpful? Give feedback.
-
I would like to point to the semantic chunkers version from Aurelio AI More details: https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/02-chunkers-async.ipynb |
Beta Was this translation helpful? Give feedback.
-
📣 Happy to announce that, as a result of this discussion, starting with docling 2.9.0 (or docling-core 2.8.0), Docling provides an additional chunker implementation, called |
Beta Was this translation helpful? Give feedback.
-
In terms of chunking approaches, there are various options on can consider, e.g. fixed-size chunking, document-based and others (example outline here).
Docling is currently providing the HierarchicalChunker, which is following a document-based approach, i.e. splits as dictated by the upstream document format. At the same time, it exposes various metadata that can be used from the user as additional context for the embedding or generation model — but also a source of grounding.
The exact metadata be included into the final text to be input into the (embedding or generation) model is application-dependent, and is therefore not prescribed by the HierarchicalChunker.
As an illustrational example of post-processing steps such as:
we have prepared an example that shows how to introduce a max-token limit — and split beyond that:
👉 notebook here
Of course, if the user is already using an LLM application framework like LlamaIndex or LangChain, they can also well tap into the big range of node parsers/splitters and postprocessing components already available in these libraries — as already showcased in our examples (LlamaIndex here & LangChain here).
Beta Was this translation helpful? Give feedback.
All reactions