Advanced chunking for RAG #191

vagenas · 2024-10-31T15:07:48Z

vagenas
Oct 31, 2024
Maintainer

In terms of chunking approaches, there are various options on can consider, e.g. fixed-size chunking, document-based and others (example outline here).

Docling is currently providing the HierarchicalChunker, which is following a document-based approach, i.e. splits as dictated by the upstream document format. At the same time, it exposes various metadata that can be used from the user as additional context for the embedding or generation model — but also a source of grounding.

The exact metadata be included into the final text to be input into the (embedding or generation) model is application-dependent, and is therefore not prescribed by the HierarchicalChunker.

As an illustrational example of post-processing steps such as:

serializing and including parts of the metadata (concretizing the actual text to embed/generate on), and
leveraging a tokenizer to apply filtering logic on the token size of that final text,

we have prepared an example that shows how to introduce a max-token limit — and split beyond that:
👉 notebook here

Of course, if the user is already using an LLM application framework like LlamaIndex or LangChain, they can also well tap into the big range of node parsers/splitters and postprocessing components already available in these libraries — as already showcased in our examples (LlamaIndex here & LangChain here).

vagenas · 2024-11-01T09:01:16Z

vagenas
Nov 1, 2024
Maintainer Author

The current notebook covers splitting of oversized chunks such that they fit within a max number of tokens.

If one wants to also merge "undersized" chunks, such that they better fit/approach that limit, the most straightforward case to consider is merging consecutive chunks of the same context, i.e. same "headings" and "captions" metadata in our example case.
(Otherwise, if one went into merging chunks of different sections, that would oppose the approach of document-based chunking & potentially lead to poorer chunking quality.)

Wrapping any such implementation as a BaseChunker, as done in the example notebook, has the advantage of allowing different approaches to be easily swappable in the client code.
This abstraction is also the one used by Docling extensions, such as the LlamaIndex extension, where the user can pass any BaseChunker to the DoclingNodeParser here.

0 replies

jwm4 · 2024-11-01T13:30:13Z

jwm4
Nov 1, 2024
Collaborator

I made a PR with an alternative notebook that does address the issue of merging within sections. As noted in the PR description, the major differences in the version I produced are:

This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
This one uses the DoclingDocument.name as the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. The DoclingDocument.name comes from document metadata and sometimes also reflects the title but is often not very useful.
This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.

Of these, I think 1, 2, and 4 are probably improvements over the version that @vagenas proposed. However, 3 should probably be undone since the titles will be in the headers list soon. Also, 5 is clearly a way that the one I am proposing is worse than the one @vagenas proposed, but I am not sure whether it is important enough to address in the one I am proposing. Also, there are lots of other minor technical differences (e.g., I have my own subclasses of BaseChunk and BaseMeta because I couldn't find a way to construct instances of the ones in the product) that would be good to get resolved.

0 replies

vagenas · 2024-11-01T15:07:24Z

vagenas
Nov 1, 2024
Maintainer Author

@jwm4

I put together a small PR targeting your branch for reusing the existing metadata class (issue was due to how the type was referenced): reuse existing chunk/meta types, fix minor issues, lint #194
In that PR I also applied some simple linting etc. as per the pre-commit hooks and added some type hints (not complete), as that helps a lot with the development, and not only.
An important point is whether relevant metadata (e.g. headings) are indeed considered when calculating the token limits. I think you are including that in your calculations, but then the final cell does not show this metadata incorporated — I guess at this stage the .text should be updated accordingly since these chunks are ready to be consumed (by an embedding model or LLM).
An other point is regarding the semchunk lib I saw you used, which I see has under 200 stars. I was wondering if there is some more established alternative to consider for that.
Maybe @ceberam you know of some or Tim has some idea (I cannot find his GitHub handle)?

0 replies

jwm4 · 2024-11-01T15:26:08Z

jwm4
Nov 1, 2024
Collaborator

I see your updates to reuse the existing metadata class. That looks good to me, thank you!
Yes, the code is intended to consider the relevant metadata when calculating token limits. For example, in merge_chunks_with_matching_metadata I check window_text_length + window_other_length + lengths["text"] <= chunk_size. The window_other_length here includes the title (which I currently get from DoclingDocument.name) and the headings and captions (both of which come from chunk metadata). However, as you note, I do not include them in the text. Instead, my expectation was that they would be added to the text both when calling the embedding model and when calling the answer generation LLM but would still be stored in the index as separate metadata. However, I can see how that adds extra complexity and makes it harder (impossible?) to provide a seamless integration with tools like LlamaIndex or LangChain. Let's keep discussing this one, because it seems like there are significant pros and cons to each alternative.
I would definitely be open to a more established alternative to semchunk if we can find one.

0 replies

jwm4 · 2024-11-01T17:32:27Z

jwm4
Nov 1, 2024
Collaborator

To illustrate my point 2 above, I went ahead and updated the notebook in my PR with new make_text_for_embedding and make_lancedb_index methods illustrating the approach. The embeddings are computed on the outputs of make_text_for_embedding but the object we put into the vector DB (lancedb) still keeps the text and metadata separate.

You can see the latest draft here: here

The latest draft was rebased on the one that @vagenas put in his PR. It includes that addition. Also, I dropped the pip install of semchunk from the notebook because it seems odd to install that one dependency and not all the others. Maybe we should just have a comprehensive pip install of all of the requirements, including docling? It looks like that's what's done in rag_langchain.ipynb for example.

0 replies

jwm4 · 2024-11-04T22:02:45Z

jwm4
Nov 4, 2024
Collaborator

I removed the use of DoclingDocument.name (which I was assuming to be the title) from my version of the notebook. As discussed above, that didn't turn out to be a good way to get a document title after all.

0 replies

vagenas · 2024-11-05T14:27:59Z

vagenas
Nov 5, 2024
Maintainer Author

However, as you note, I do not include them in the text. Instead, my expectation was that they would be added to the text both when calling the embedding model and when calling the answer generation LLM but would still be stored in the index as separate metadata. However, I can see how that adds extra complexity and makes it harder (impossible?) to provide a seamless integration with tools like LlamaIndex or LangChain.

@jwm4 (cc @ceberam)
as the proposed chunker makes a decision on the metadata to consider (headings & captions) and incorporates it into the token limit resolution, I think it should also reflect it to .text right away (otherwise, how would one know how to reconstruct the final text to pass to the model downstream).

Regarding Semchunk: it looks like too thin a layer for warranting an additional dependency. Perhaps we can understand how to incorporate some of the basic ideas instead?

Regarding the final outcome:
For starters let's wrap everything as a BaseChunker subclass as you have already started doing (we could also try to help with the streaming part).
Then including this as an additional chunker implementation in docling-core would be a possibility as a next step, we should just be aware that this would imply adding dependencies like transformers to Docling Core.

0 replies

ceberam · 2024-11-05T14:57:04Z

ceberam
Nov 5, 2024
Collaborator

@vagenas @jwm4

An other point is regarding the semchunk lib I saw you used, which I see has under 200 stars. I was wondering if there is some more established alternative to consider for that.
Maybe @ceberam you know of some or Tim has some idea (I cannot find his GitHub handle)?

We can leverage the frameworks, since they have implementations of semantic chunking using embedding models.

With LlamaIndex SemanticSplitterNodeParser
With LangChain SemanticChunker

0 replies

jwm4 · 2024-11-07T19:25:11Z

jwm4
Nov 7, 2024
Collaborator

@vagenas writes:

it looks like too thin a layer for warranting an additional dependency. Perhaps we can understand how to incorporate some of the basic ideas instead?

I acknowledge that it is only 300 lines of code, and the explanation of what it does is fairly simple. However, if you look closely at the benchmark results for semchunk (in their readme) they really are remarkable, which suggests that this is very well optimized code.

I am not sure I understand @ceberam 's proposal to leverage LlamaIndex and/or LangChain. Dragging in dependencies on either or both of those packages seems like a very heavy dependency to add just to get a generic text chunker. Do we already have those dependencies? I thought the dependencies flowed the other way in the existing integrations, but I don't really know.

1 reply

vagenas Nov 11, 2024
Maintainer Author

(Indeed, I think @ceberam was under the impression that this part was in the context of using one of these frameworks, which is not necessarily the case.)

vagenas · 2024-11-11T16:19:45Z

vagenas
Nov 11, 2024
Maintainer Author

@jwm4

I have provided a new iteration on top of your last changes, with #310 — the notebook can be directly viewed here.

As you see there, the main improvements are in:

~~setting chunk.text directly to updated text (including any headings, captions) — as I believe should be the case~~
extending the typing
switching to list comprehensions where possible
encapsulating all methods within new chunker implementation
using dataclass instead of unmanaged dictionary
listing dependencies in setup installation line

🐛 I think a bug has been (and is still) there though:
Looking at the "Usage" section of the notebook, you see that despite setting the token limit to 64, we get back some chunks with length above that number.

Perhaps some of the processing/merging/splitting steps are not properly taking into account the headings/captions — or the tokens added with _make_text_for_embedding are actually more than initially accounted for? 🤔

2 replies

vagenas Nov 11, 2024
Maintainer Author

Let's first figure out this bug & then we can finalize the rest of the points, such as external dependencies.

vagenas Nov 12, 2024
Maintainer Author

@jwm4

I took the liberty of looking into this and found the bug, which was that the semchunk.Chunker was always initialized with the static self.max_tokens, not accounting for the current chunk's metadata:

docling/docs/examples/advanced_chunking_with_merging.ipynb

Line 732 in 98efb89

" splitter = make_splitter(tokenizer, self.max_tokens)\n",

👉 I pushed my fix into the same PR #310 — notebook here.

jwm4 · 2024-11-15T20:45:11Z

jwm4
Nov 15, 2024
Collaborator

Hi. Sorry for the slow reply. It has been a crazy week. Here are my thoughts:

I agree with you about the bug and your fix for it. Thank you for tracking that down.
Setting chunk.text directly to updated text (including any headings, captions) still seems problematic to me. This is counterproductive for someone doing the sort of thing that I did in my example indexing and retrieval code because it doesn't give consumers the flexibility to combine the metadata and chunk text the way they want for consumption since it all winds up mushed together in one big string. For those users, it is better to combine the metadata and chunk text only for purposes of computing the embeddings and then store them separately. However, I do understand other users are going to want to plug these chunks into some generic framework that only runs the embeddings on the chunk.text and nothing else, and those users would prefer your way. Here are two alternatives for how to address this: (1) we could have the chunker output separate fields for the text-with-metadata and text-without-metadata, or (2) we could have a boolean parameter for the chunker that controls whether the text includes the metadata or not.
The other changes (list comprehensions, encapsulating in the class, etc.) all seem reasonable to me.

1 reply

GabHoo Nov 18, 2024

Boolean parameter seems reasonable! If I am not mistaken is also what langchain Markdown splitter works (strip_headers)
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
When/If hierarchical chunking will also be implemented (more nested headings) it would be great also to be able to append the previous super-headers on top of the chunk, to preserve as much context as possible (should not be too hard so can also be left for the user to do, I have a snippet already).
Thank you for the good work!

vagenas · 2024-11-20T00:01:52Z

vagenas
Nov 20, 2024
Maintainer Author

Thanks for the feedback, I have now merged all changes to a single branch to simplify:

PR: feat: expose new hybrid chunker, update docs #384
Notebook: here

Some more details:

it doesn't give consumers the flexibility to combine the metadata and chunk text the way they want for consumption

Actually, I do agree we should keep .text free of enrichments and that's exactly why I separated it from the metadata in v2 (just momentarily caught into our previous way of doing things with v1 in the posts above..).

(1) we could have the chunker output separate fields for the text-with-metadata and text-without-metadata, or (2) we could have a boolean parameter for the chunker that controls whether the text includes the metadata or not.

I am taking a step beyond (1) and expanding the chunks with explicit get_text_for_embedding() and get_text_for_generation() methods (DS4SD/docling-core#68) and next I am looking how to best address a case like #342 using this approach.

FYI: this differentiation between what is passed to the embed model vs what is passed to the gen model is also present in frameworks like LlamaIndex — albeit in a different fashion (ref), so integration may also require some further changes.

0 replies

simjak · 2024-12-05T08:03:10Z

simjak
Dec 5, 2024

I would like to point to the semantic chunkers version from Aurelio AI
https://github.com/aurelio-labs/semantic-chunkers
I have developed an advanced semantic chunking algorithm StatisticalChunker, which adjusts semantic thresholds dynamically based on average, similarly on the window size of text blocks.
It has an implementation for both sync and async. Async works 10x times faster.

More details: https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/02-chunkers-async.ipynb

0 replies

vagenas · 2024-12-09T10:06:49Z

vagenas
Dec 9, 2024
Maintainer Author

📣 Happy to announce that, as a result of this discussion, starting with docling 2.9.0 (or docling-core 2.8.0), Docling provides an additional chunker implementation, called HybridChunker, which uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.

👉 More details: here
👉 Sample notebook: here

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced chunking for RAG #191

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 17 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advanced chunking for RAG #191

vagenas Oct 31, 2024 Maintainer

Replies: 17 comments · 4 replies

vagenas Nov 1, 2024 Maintainer Author

jwm4 Nov 1, 2024 Collaborator

vagenas Nov 1, 2024 Maintainer Author

jwm4 Nov 1, 2024 Collaborator

jwm4 Nov 1, 2024 Collaborator

jwm4 Nov 4, 2024 Collaborator

vagenas Nov 5, 2024 Maintainer Author

ceberam Nov 5, 2024 Collaborator

jwm4 Nov 7, 2024 Collaborator

vagenas Nov 11, 2024 Maintainer Author

vagenas Nov 11, 2024 Maintainer Author

vagenas Nov 11, 2024 Maintainer Author

vagenas Nov 12, 2024 Maintainer Author

jwm4 Nov 15, 2024 Collaborator

GabHoo Nov 18, 2024

vagenas Nov 20, 2024 Maintainer Author

simjak Dec 5, 2024

vagenas Dec 9, 2024 Maintainer Author

vagenas
Oct 31, 2024
Maintainer

Replies: 17 comments 4 replies

vagenas
Nov 1, 2024
Maintainer Author

jwm4
Nov 1, 2024
Collaborator

vagenas
Nov 1, 2024
Maintainer Author

jwm4
Nov 1, 2024
Collaborator

jwm4
Nov 1, 2024
Collaborator

jwm4
Nov 4, 2024
Collaborator

vagenas
Nov 5, 2024
Maintainer Author

ceberam
Nov 5, 2024
Collaborator

jwm4
Nov 7, 2024
Collaborator

vagenas Nov 11, 2024
Maintainer Author

vagenas
Nov 11, 2024
Maintainer Author

vagenas Nov 11, 2024
Maintainer Author

vagenas Nov 12, 2024
Maintainer Author

jwm4
Nov 15, 2024
Collaborator

vagenas
Nov 20, 2024
Maintainer Author

simjak
Dec 5, 2024

vagenas
Dec 9, 2024
Maintainer Author