Ensuring original text is preserved in CHUNKING_REGEX #18054

tylerganter · 2025-03-08T02:34:35Z

Description

The original default CHUNKING_REGEX wasn't preserving the original text, repeating characters in ,.;。？！ were reduced to a singe instance, so in a scenario like [text](../path/to/file.md) this can get reconstructed as [text](./path/to/file.md) which will not match the original text. This then results in SentenceSplitter.start_char_idx = None because it fails to find the chunk text in the source document text.

Fixes # (issue)

Version Bump?

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Simple Regex Text Preservation Test

import re

text = '[**this is a link](.........../path/to/file.md) some\nmore;;text'

# Original pattern
original_pattern = "[^,.;。？！]+[,.;。？！]?"
original_splits = re.findall(original_pattern, text)
print(f"Original pattern: {original_pattern}")
print(f"Original splits: {original_splits}")
print(f"Joined result: {''.join(original_splits)}")
print(f"Matches original? {''.join(original_splits) == text}")
# Original pattern: [^,.;。？！]+[,.;。？！]?
# Original splits: ['[**this is a link](.', '/path/to/file.', 'md) some\nmore;', 'text']
# Joined result: [**this is a link](./path/to/file.md) some
# more;text
# Matches original? False

# Enhanced pattern that preserves the text while maintaining similar behavior
# This pattern: 
# 1. Captures sequences of non-punctuation followed by optional punctuation: [^,.;。？！]+[,.;。？！]?
# 2. OR captures single punctuation characters by themselves: [,.;。？！]
enhanced_pattern = r"[^,.;。？！]+[,.;。？！]?|[,.;。？！]"
enhanced_splits = re.findall(enhanced_pattern, text)
print(f"\nEnhanced pattern: {enhanced_pattern}")
print(f"Enhanced splits: {enhanced_splits}")
print(f"Joined result: {''.join(enhanced_splits)}")
print(f"Matches original? {''.join(enhanced_splits) == text}")
# Enhanced pattern: [^,.;。？！]+[,.;。？！]?|[,.;。？！]
# Enhanced splits: ['[**this is a link](.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '/path/to/file.', 'md) some\nmore;', ';', 'text']
# Joined result: [**this is a link](.........../path/to/file.md) some
# more;;text
# Matches original? True

# Test cases to verify the solution
test_cases = [
    "Simple text with no punctuation",
    "Text with, some; punctuation.",
    "Text with......multiple punctuation",
    "Text with\nnewlines\npreserved",
    "Text with,.;。？！ all specified punctuation",
    "Text with consecutive....periods",
    "Text with;;semicolons",
    "[**Markdown-style](links) preserved"
]

print("\n--- Testing with various cases ---")
for case in test_cases:
    splits = re.findall(enhanced_pattern, case)
    reconstructed = ''.join(splits)
    print(f"Case: {case!r}")
    print(f"Splits: {splits}")
    print(f"Preserved: {reconstructed == case}")
    if reconstructed != case:
        print(f"  Expected: {case!r}")
        print(f"  Got: {reconstructed!r}")
    print()
# --- Testing with various cases ---
# Case: 'Simple text with no punctuation'
# Splits: ['Simple text with no punctuation']
# Preserved: True

# Case: 'Text with, some; punctuation.'
# Splits: ['Text with,', ' some;', ' punctuation.']
# Preserved: True

# Case: 'Text with......multiple punctuation'
# Splits: ['Text with.', '.', '.', '.', '.', '.', 'multiple punctuation']
# Preserved: True

# Case: 'Text with\nnewlines\npreserved'
# Splits: ['Text with\nnewlines\npreserved']
# Preserved: True

# Case: 'Text with,.;。？！ all specified punctuation'
# Splits: ['Text with,', '.', ';', '。', '？', '！', ' all specified punctuation']
# Preserved: True

# Case: 'Text with consecutive....periods'
# Splits: ['Text with consecutive.', '.', '.', '.', 'periods']
# Preserved: True

# Case: 'Text with;;semicolons'
# Splits: ['Text with;', ';', 'semicolons']
# Preserved: True

# Case: '[**Markdown-style](links) preserved'
# Splits: ['[**Markdown-style](links) preserved']
# Preserved: True

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have added Google Colab support for the newly added notebooks.~~
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

logan-markewich · 2025-03-09T22:17:38Z

llama-index-core/tests/text_splitter/test_sentence_splitter.py

@@ -27,6 +27,19 @@ def test_start_end_char_idx() -> None:
        )


+def test_start_end_char_idx_repeating_regex_chars() -> None:
+    """Test case of a string with repeating characters in [,.;。？！]."""
+    document = Document(text="[this is a link](../path/to/file.md) " * 12)


This test isn't really testing for how the content is split right? Shouldn't it check if the text has repeated .. or not?

tylerganter added 2 commits March 7, 2025 18:57

ensuring original text is preserved in CHUNKING_REGEX

1869de8

adding test for chunking regex bugfix

8e15245

tylerganter marked this pull request as ready for review March 8, 2025 02:56

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Mar 8, 2025

typo

82cd45b

logan-markewich reviewed Mar 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring original text is preserved in CHUNKING_REGEX #18054

Ensuring original text is preserved in CHUNKING_REGEX #18054

tylerganter commented Mar 8, 2025 •

edited

Loading

logan-markewich Mar 9, 2025

Ensuring original text is preserved in CHUNKING_REGEX #18054

Are you sure you want to change the base?

Ensuring original text is preserved in CHUNKING_REGEX #18054

Conversation

tylerganter commented Mar 8, 2025 • edited Loading

Description

Version Bump?

Type of Change

How Has This Been Tested?

Simple Regex Text Preservation Test

Suggested Checklist:

logan-markewich Mar 9, 2025

Choose a reason for hiding this comment

tylerganter commented Mar 8, 2025 •

edited

Loading