Skip to content

Commit 237d8fa

Browse files
authored
Chunking overview: remove usage note about not setting combine text >= new after n characters (#662)
1 parent 3750df0 commit 237d8fa

File tree

1 file changed

+21
-18
lines changed

1 file changed

+21
-18
lines changed

ui/chunking.mdx

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,23 @@ This strategy does not use section boundaries, page boundaries, or content simil
7979
the chunks' contents.
8080

8181
The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and
82-
new after n characters (soft) limits:
82+
new after n characters (soft) limits.
83+
84+
- In scenario 1, the candidate element exceeds the hard limit, and so the candidate element will become the first element in the next chunk.
85+
- In scenario 2, the first candidate element exceeds the soft limit but remains within the hard limit. Because the second candidate element begins
86+
after the soft limit has been reached, the second candidate element will become the first element in the next chunk.
87+
- In scenario 3, the first two candidate elements exceed the soft limit but remain within the hard limit. Even though the third candidate element
88+
remains within the hard limit, because it begins after the soft limit has been reached, the third candidate element will become the first element in the next chunk.
8389

8490
![Chunking with hard and soft limits](/img/chunking/Chunking_Soft_Hard_Limits.png)
8591

86-
The following two diagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
92+
The following two conceptualdiagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
8793

88-
In this first diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
94+
In this first conceptual diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
8995

9096
![Basic chunking of text with a 200-character hard limit](/img/chunking/Chunk-By-Character-200-Paragraph.png)
9197

92-
In this second diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
98+
In this second conceptual diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
9399
row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending.
94100
The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending:
95101

@@ -102,10 +108,10 @@ By default, overlap all is applied only to relatively large elements. If overlap
102108
The overlap setting is based on the number of characters, so words might be split.
103109
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
104110

105-
The following diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram,
106-
setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
107-
By default (or by setting overalp all to false) results in only a portion at the end of Element 6 Part 1 in Chunk 2 being copied over
108-
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting:
111+
The following conceptual diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram, setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
112+
113+
By default (or by setting overalp all to false), only a portion at the end of Element 6 Part 1 in Chunk 2 being is copied over
114+
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting.
109115

110116
![Chunking with overall all set to true or false](/img/chunking/Chunking_Overlap_All.png)
111117

@@ -122,30 +128,27 @@ The by-title chunking strategy attempts to preserve section boundaries when dete
122128
a **Title** element is encountered. The title is used as the section header for the chunk. The max characters and new after n
123129
characters settings are still respected.
124130

125-
The following diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
131+
The following conceptual diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
126132
Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3):
127133

128134
![Chunking by title](/img/chunking/Chunking_By_Title.png)
129135

130136
A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
131137
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
132138

133-
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks:
139+
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks.
134140

135141
![Many titles can lead to many chunks by title](/img/chunking/Chunking_By_Title_Segmentation.png)
136142

137143
To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This
138-
settings attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
139-
following conceptual diagram:
144+
setting attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
145+
following conceptual diagram. In this case, multiple **Title** elements are combined into a single chunk. However, when the
146+
combine text under n characters limit is reached, the chunk is closed and a new one is started. In any case, the new chunk must start with a **Title** element.
147+
For instance, if Element 3 exceeded the combine text under n characters limit, the chunk would be closed and a new one would be started, beginning
148+
with Title 2, followed by Element 3.
140149

141150
![Chunking with combine text under n characters](/img/chunking/Chunking_Combine_Text.png)
142151

143-
Setting combine text under n characters to a value equal to or greater than the new after n characters setting is not recommended, as it
144-
can result in substantially longer chunks overall and also pushing titles by themselves into previous chunks. The following conceptual
145-
diagram illustrates this point:
146-
147-
![Chunking with combine text under n characters issue](/img/chunking/Chunking_Combine_Text_Limits.png)
148-
149152
The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text.
150153
Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the
151154
title immediately after it, which starts a new chunk:

0 commit comments

Comments
 (0)