You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ui/chunking.mdx
+21-18Lines changed: 21 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -79,17 +79,23 @@ This strategy does not use section boundaries, page boundaries, or content simil
79
79
the chunks' contents.
80
80
81
81
The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and
82
-
new after n characters (soft) limits:
82
+
new after n characters (soft) limits.
83
+
84
+
- In scenario 1, the candidate element exceeds the hard limit, and so the candidate element will become the first element in the next chunk.
85
+
- In scenario 2, the first candidate element exceeds the soft limit but remains within the hard limit. Because the second candidate element begins
86
+
after the soft limit has been reached, the second candidate element will become the first element in the next chunk.
87
+
- In scenario 3, the first two candidate elements exceed the soft limit but remain within the hard limit. Even though the third candidate element
88
+
remains within the hard limit, because it begins after the soft limit has been reached, the third candidate element will become the first element in the next chunk.
83
89
84
90

85
91
86
-
The following two diagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
92
+
The following two conceptualdiagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements.
87
93
88
-
In this first diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
94
+
In this first conceptual diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized:
89
95
90
96

91
97
92
-
In this second diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
98
+
In this second conceptual diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables,
93
99
row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending.
94
100
The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending:
95
101
@@ -102,10 +108,10 @@ By default, overlap all is applied only to relatively large elements. If overlap
102
108
The overlap setting is based on the number of characters, so words might be split.
103
109
The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting.
104
110
105
-
The following diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram,
106
-
setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
107
-
By default (or by setting overalp all to false) results in only a portion at the end of Element 6 Part 1 in Chunk 2 being copied over
108
-
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting:
111
+
The following conceptual diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram, setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk.
112
+
113
+
By default (or by setting overalp all to false), only a portion at the end of Element 6 Part 1 in Chunk 2 being is copied over
114
+
to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting.
109
115
110
116

111
117
@@ -122,30 +128,27 @@ The by-title chunking strategy attempts to preserve section boundaries when dete
122
128
a **Title** element is encountered. The title is used as the section header for the chunk. The max characters and new after n
123
129
characters settings are still respected.
124
130
125
-
The following diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
131
+
The following conceptual diagram illustrates conceptually how elements are chunked when **Title** elements are encountered (see
126
132
Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3):
127
133
128
134

129
135
130
136
A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
131
137
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.
132
138
133
-
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks:
139
+
The following conceptual diagram illustrates this point, in that many **Title** elements can produce many relatively small chunks.
134
140
135
141

136
142
137
143
To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This
138
-
settings attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
139
-
following conceptual diagram:
144
+
setting attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the
145
+
following conceptual diagram. In this case, multiple **Title** elements are combined into a single chunk. However, when the
146
+
combine text under n characters limit is reached, the chunk is closed and a new one is started. In any case, the new chunk must start with a **Title** element.
147
+
For instance, if Element 3 exceeded the combine text under n characters limit, the chunk would be closed and a new one would be started, beginning
148
+
with Title 2, followed by Element 3.
140
149
141
150

142
151
143
-
Setting combine text under n characters to a value equal to or greater than the new after n characters setting is not recommended, as it
144
-
can result in substantially longer chunks overall and also pushing titles by themselves into previous chunks. The following conceptual
145
-
diagram illustrates this point:
146
-
147
-

148
-
149
152
The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text.
150
153
Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the
151
154
title immediately after it, which starts a new chunk:
0 commit comments