[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

SaarBarak · 2024-12-26T12:03:50Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

My goal is to embed a short metadata string for the index while providing a longer metadata string visible to the LLM. I'm trying to achieve this using excluded_llm_metadata_keys and excluded_embed_metadata_keys. However, I'm struggling to implement this because all metadata-aware splitters currently prioritize the longest metadata string for splitting.

Here's the relevant implementation from the metadata-aware splitter:

# from core.node_parser.interface import _get_metadata_str

def _get_metadata_str(self, node: BaseNode) -> str:
    """Helper function to get the proper metadata str for splitting."""
    embed_metadata_str = node.get_metadata_str(mode=MetadataMode.EMBED)
    llm_metadata_str = node.get_metadata_str(mode=MetadataMode.LLM)

    # use the longest metadata str for splitting
    if len(embed_metadata_str) > len(llm_metadata_str):
        metadata_str = embed_metadata_str
    else:
        metadata_str = llm_metadata_str

    return metadata_str

As a result, when I use a splitter with a small chunk size, I encounter the following error:

ValueError: Metadata length (1720) is longer than chunk size (512). 
Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

Question:
How can I modify the implementation to allow the splitter to prioritize the shorter metadata string (e.g., the embed metadata) for chunk size calculations, while still using the longer metadata string for the LLM? Any guidance on handling this more effectively would be appreciated.

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-26T12:05:04Z

@SaarBarak Hello there! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together.

To modify the implementation so that the splitter prioritizes the shorter metadata string for chunk size calculations while still using the longer metadata string for the LLM, you can adjust the _get_metadata_str method in the MetadataAwareTextSplitter class. Here's how you can do it:

def _get_metadata_str(self, node: BaseNode) -> str:
    """Helper function to get the proper metadata str for splitting."""
    embed_metadata_str = node.get_metadata_str(mode=MetadataMode.EMBED)
    llm_metadata_str = node.get_metadata_str(mode=MetadataMode.LLM)

    # use the shortest metadata str for splitting
    if len(embed_metadata_str) < len(llm_metadata_str):
        metadata_str = embed_metadata_str
    else:
        metadata_str = llm_metadata_str

    return metadata_str

This change ensures that the shorter metadata string is used for chunk size calculations, which should help avoid the error you're encountering ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

SaarBarak added the question Further information is requested label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

SaarBarak commented Dec 26, 2024

dosubot bot commented Dec 26, 2024

[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

[Question]: Adjusting Metadata Handling for Chunk Splitting in Node Parsers #17373

Comments

SaarBarak commented Dec 26, 2024

Question Validation

Question

dosubot bot commented Dec 26, 2024