Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

Open
1 task done
SaarBarak opened this issue Dec 26, 2024 · 1 comment
Open
1 task done
Labels
question Further information is requested

Comments

@SaarBarak
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

My goal is to embed a short metadata string for the index while providing a longer metadata string visible to the LLM. I'm trying to achieve this using excluded_llm_metadata_keys and excluded_embed_metadata_keys. However, I'm struggling to implement this because all metadata-aware splitters currently prioritize the longest metadata string for splitting.

Here's the relevant implementation from the metadata-aware splitter:

# from core.node_parser.interface import _get_metadata_str

def _get_metadata_str(self, node: BaseNode) -> str:
    """Helper function to get the proper metadata str for splitting."""
    embed_metadata_str = node.get_metadata_str(mode=MetadataMode.EMBED)
    llm_metadata_str = node.get_metadata_str(mode=MetadataMode.LLM)

    # use the longest metadata str for splitting
    if len(embed_metadata_str) > len(llm_metadata_str):
        metadata_str = embed_metadata_str
    else:
        metadata_str = llm_metadata_str

    return metadata_str

As a result, when I use a splitter with a small chunk size, I encounter the following error:

ValueError: Metadata length (1720) is longer than chunk size (512). 
Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

Question:
How can I modify the implementation to allow the splitter to prioritize the shorter metadata string (e.g., the embed metadata) for chunk size calculations, while still using the longer metadata string for the LLM? Any guidance on handling this more effectively would be appreciated.

@SaarBarak SaarBarak added the question Further information is requested label Dec 26, 2024
Copy link

dosubot bot commented Dec 26, 2024

@SaarBarak Hello there! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together.

To modify the implementation so that the splitter prioritizes the shorter metadata string for chunk size calculations while still using the longer metadata string for the LLM, you can adjust the _get_metadata_str method in the MetadataAwareTextSplitter class. Here's how you can do it:

def _get_metadata_str(self, node: BaseNode) -> str:
    """Helper function to get the proper metadata str for splitting."""
    embed_metadata_str = node.get_metadata_str(mode=MetadataMode.EMBED)
    llm_metadata_str = node.get_metadata_str(mode=MetadataMode.LLM)

    # use the shortest metadata str for splitting
    if len(embed_metadata_str) < len(llm_metadata_str):
        metadata_str = embed_metadata_str
    else:
        metadata_str = llm_metadata_str

    return metadata_str

This change ensures that the shorter metadata string is used for chunk size calculations, which should help avoid the error you're encountering [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant