You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched both the documentation and discord for an answer.
Question
My goal is to embed a short metadata string for the index while providing a longer metadata string visible to the LLM. I'm trying to achieve this using excluded_llm_metadata_keys and excluded_embed_metadata_keys. However, I'm struggling to implement this because all metadata-aware splitters currently prioritize the longest metadata string for splitting.
Here's the relevant implementation from the metadata-aware splitter:
# from core.node_parser.interface import _get_metadata_strdef_get_metadata_str(self, node: BaseNode) ->str:
"""Helper function to get the proper metadata str for splitting."""embed_metadata_str=node.get_metadata_str(mode=MetadataMode.EMBED)
llm_metadata_str=node.get_metadata_str(mode=MetadataMode.LLM)
# use the longest metadata str for splittingiflen(embed_metadata_str) >len(llm_metadata_str):
metadata_str=embed_metadata_strelse:
metadata_str=llm_metadata_strreturnmetadata_str
As a result, when I use a splitter with a small chunk size, I encounter the following error:
ValueError: Metadata length (1720) is longer than chunk size (512).
Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
Question:
How can I modify the implementation to allow the splitter to prioritize the shorter metadata string (e.g., the embed metadata) for chunk size calculations, while still using the longer metadata string for the LLM? Any guidance on handling this more effectively would be appreciated.
The text was updated successfully, but these errors were encountered:
@SaarBarak Hello there! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together.
To modify the implementation so that the splitter prioritizes the shorter metadata string for chunk size calculations while still using the longer metadata string for the LLM, you can adjust the _get_metadata_str method in the MetadataAwareTextSplitter class. Here's how you can do it:
def_get_metadata_str(self, node: BaseNode) ->str:
"""Helper function to get the proper metadata str for splitting."""embed_metadata_str=node.get_metadata_str(mode=MetadataMode.EMBED)
llm_metadata_str=node.get_metadata_str(mode=MetadataMode.LLM)
# use the shortest metadata str for splittingiflen(embed_metadata_str) <len(llm_metadata_str):
metadata_str=embed_metadata_strelse:
metadata_str=llm_metadata_strreturnmetadata_str
This change ensures that the shorter metadata string is used for chunk size calculations, which should help avoid the error you're encountering [1].
Question Validation
Question
My goal is to embed a short metadata string for the index while providing a longer metadata string visible to the LLM. I'm trying to achieve this using
excluded_llm_metadata_keys
andexcluded_embed_metadata_keys
. However, I'm struggling to implement this because all metadata-aware splitters currently prioritize the longest metadata string for splitting.Here's the relevant implementation from the metadata-aware splitter:
As a result, when I use a splitter with a small chunk size, I encounter the following error:
Question:
How can I modify the implementation to allow the splitter to prioritize the shorter metadata string (e.g., the embed metadata) for chunk size calculations, while still using the longer metadata string for the LLM? Any guidance on handling this more effectively would be appreciated.
The text was updated successfully, but these errors were encountered: