Skip to content

⚡️ Speed up function extract_text_from_spans by 31%#40

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-extract_text_from_spans-mkot6ym0
Open

⚡️ Speed up function extract_text_from_spans by 31%#40
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-extract_text_from_spans-mkot6ym0

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 22, 2026

📄 31% (0.31x) speedup for extract_text_from_spans in unstructured_inference/models/table_postprocess.py

⏱️ Runtime : 7.78 milliseconds 5.95 milliseconds (best of 95 runs)

📝 Explanation and details

The optimized code achieves a 30% speedup by replacing three consecutive sorts with a single sort using a tuple key.

Key Optimization

Original approach:

spans_copy.sort(key=lambda span: span["span_num"])
spans_copy.sort(key=lambda span: span["line_num"])
spans_copy.sort(key=lambda span: span["block_num"])

Optimized approach:

spans_copy.sort(key=lambda span: (span["block_num"], span["line_num"], span["span_num"]))

Why This Works

Python's sort is stable, so the original code sorts by span_num, then re-sorts by line_num (preserving span_num order within each line), then re-sorts by block_num (preserving the previous ordering). However, this executes the sorting algorithm three times.

The optimized version leverages Python's native tuple comparison: when sorting by (block_num, line_num, span_num), Python automatically compares block first, then line within the same block, then span within the same line. This achieves the identical final ordering in a single pass.

Performance Impact

From the line profiler data:

  • Original: Three sorts take ~14.8ms total (9.5% + 9.8% + 9.5% of 51.68ms)
  • Optimized: Single sort takes ~6.6ms (15.7% of 42.14ms)

The single-pass sort is ~2.2x faster than three separate sorts, directly contributing to the overall 30% speedup.

Test Results Show Strong Gains at Scale

The optimization particularly shines with larger datasets:

  • test_large_scale_many_spans_sorted_and_joined_correctly: 50.6% faster (344μs → 228μs)
  • test_large_scale_unsorted_spans: 44.2% faster (249μs → 172μs)
  • test_large_scale_with_unicode_spans: 55.8% faster (224μs → 144μs)
  • test_large_scale_mixed_block_line_spans: 37.7% faster (3.70ms → 2.69ms)

Smaller test cases show modest 5-15% improvements, confirming the optimization's value scales with input size.

Workload Context

Based on function_references, this function is called by extract_text_inside_bbox() during table text extraction. Since table processing often involves:

  • Many spans per table cell
  • Repeated calls for multiple bounding boxes
  • Potentially large documents with numerous tables

The 30% speedup directly reduces latency in document parsing pipelines, especially for documents with complex table structures containing many text spans that need proper ordering.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 51 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pytest  # used for our unit tests
from unstructured_inference.models.table_postprocess import \
    extract_text_from_spans

def test_basic_join_with_space_single_line():
    # Two spans on the same block/line should be joined with a space by default.
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1},
    ]
    # Expect "Hello World" when join_with_space=True (default).
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.20μs -> 6.50μs (10.9% faster)

def test_join_without_space_single_line():
    # When join_with_space is False, spans in the same line should be concatenated without spaces.
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1},
    ]
    # Expect "HelloWorld" when join_with_space=False.
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 7.50μs -> 6.95μs (7.88% faster)

def test_hyphen_at_line_end_prevents_extra_space_when_join_without_space():
    # The function should not add an extra space at a line break if the line ends with a hyphen
    # that is not preceded by a space when join_with_space=False.
    spans = [
        # first line ends with hyphen
        {"text": "ex-", "block_num": 0, "line_num": 0, "span_num": 0},
        # new line begins; different line_num forces the end-of-line handling
        {"text": "ample", "block_num": 0, "line_num": 1, "span_num": 0},
    ]
    # No space should be introduced between "ex-" and "ample" because of the hyphen rule.
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 8.80μs -> 8.26μs (6.49% faster)

def test_line_end_space_is_handled_correctly_when_join_without_space():
    # If end of a line would cause the function to append a space when join_with_space=False,
    # it should do so when appropriate to separate consecutive lines.
    spans = [
        {"text": "First", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "Line", "block_num": 0, "line_num": 1, "span_num": 0},
    ]
    # With join_with_space=False the algorithm adds exactly one space at the end of the first line
    # (unless hyphen rules apply). The final output should be "First Line".
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 8.64μs -> 8.21μs (5.25% faster)

def test_remove_numeric_superscripts_and_keep_non_numeric_marked():
    # Create three spans: normal text, numeric superscript flagged, non-numeric superscript flagged.
    spans = [
        {"text": "Item", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "2", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},  # numeric superscript -> removed
        {"text": "th", "block_num": 0, "line_num": 0, "span_num": 2, "flags": 1},  # non-numeric -> kept & marked
    ]
    # After extraction, numeric "2" should be removed, "th" should remain and also be marked in the original dict.
    codeflash_output = extract_text_from_spans(spans, join_with_space=True, remove_integer_superscripts=True); result = codeflash_output # 9.86μs -> 9.06μs (8.80% faster)

def test_all_superscripts_removed_returns_empty_string():
    # If all spans are numeric and flagged as superscripts, they should be removed and function returns "".
    spans = [
        {"text": "1", "block_num": 0, "line_num": 0, "span_num": 0, "flags": 1},
        {"text": "2", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=True); result = codeflash_output # 4.04μs -> 3.79μs (6.38% faster)

def test_sorting_by_block_line_and_span_numbers_reorders_unordered_input():
    # Provide spans in a shuffled order with distinct block/line/span numbers.
    spans = [
        {"text": "b1", "block_num": 1, "line_num": 0, "span_num": 0},
        {"text": "a1", "block_num": 0, "line_num": 1, "span_num": 0},
        {"text": "a0", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "b0", "block_num": 1, "line_num": 0, "span_num": 1},
    ]
    # Sorting should result in block 0 first (a0 then a1), then block 1 (b1 then b0 ordered by span_num).
    # Expected concatenation with default join_with_space=True:
    # After sorting: a0 (block0,line0), a1 (block0,line1) => they produce "a0 a1", then b1 and b0 sorted by span_num produce "b1 b0".
    # The final string is the three pieces joined with spaces (lines and blocks are separated by spaces by the algorithm).
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 11.0μs -> 10.0μs (10.2% faster)
    # Build expected by stable sort: sort by block,line,span
    expected_order = sorted(spans, key=lambda s: (s["block_num"], s["line_num"], s["span_num"]))
    expected_text = " ".join(s["text"] for s in expected_order)

def test_large_scale_many_spans_sorted_and_joined_correctly():
    # Create a moderately large collection of spans (500) across multiple lines and one block.
    # We intentionally provide them in reverse order so the sorting must reorder them.
    n_lines = 50
    spans_per_line = 10  # 50 * 10 = 500 spans total
    total_spans = n_lines * spans_per_line

    spans = []
    # Create spans in reverse sorted order to ensure the sort changes the order.
    for line in range(n_lines - 1, -1, -1):
        for span_num in range(spans_per_line - 1, -1, -1):
            idx = line * spans_per_line + span_num
            spans.append(
                {
                    "text": f"w{idx}",
                    "block_num": 0,
                    "line_num": line,
                    "span_num": span_num,
                }
            )

    # Compute expected result by sorting spans by block,line,span and joining with spaces.
    expected_sorted = sorted(spans, key=lambda s: (s["block_num"], s["line_num"], s["span_num"]))
    expected_text = " ".join(s["text"] for s in expected_sorted)

    # Run the function and assert equality.
    codeflash_output = extract_text_from_spans(spans, join_with_space=True); result = codeflash_output # 344μs -> 228μs (50.6% faster)
    # Sanity checks: length of tokens in result should match number of spans when splitting by space.
    result_tokens = result.split(" ")
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from unstructured_inference.models.table_postprocess import \
    extract_text_from_spans

def test_single_span_single_block_single_line():
    """Test basic functionality with a single span in a single block and line."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 5.99μs -> 5.28μs (13.5% faster)

def test_multiple_spans_same_line_same_block():
    """Test multiple spans on the same line within the same block joined with space."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.23μs -> 6.66μs (8.51% faster)

def test_multiple_lines_same_block():
    """Test multiple lines within the same block."""
    spans = [
        {"text": "Line", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "one", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "Line", "block_num": 0, "line_num": 1, "span_num": 0},
        {"text": "two", "block_num": 0, "line_num": 1, "span_num": 1}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 9.94μs -> 9.29μs (7.04% faster)

def test_multiple_blocks():
    """Test spans across multiple blocks."""
    spans = [
        {"text": "Block", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "one", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "Block", "block_num": 1, "line_num": 0, "span_num": 0},
        {"text": "two", "block_num": 1, "line_num": 0, "span_num": 1}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 9.88μs -> 9.14μs (8.05% faster)

def test_join_without_space():
    """Test joining spans without spaces using join_with_space=False."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1}
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 7.40μs -> 6.96μs (6.37% faster)

def test_integer_superscript_removal():
    """Test removal of integer superscripts when remove_integer_superscripts=True."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "1", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=True); result = codeflash_output # 8.87μs -> 8.22μs (7.88% faster)

def test_non_integer_superscript_not_removed():
    """Test that non-integer superscripts are not removed but marked."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "a", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=True); result = codeflash_output # 9.71μs -> 9.38μs (3.48% faster)

def test_superscript_removal_disabled():
    """Test that superscripts are not removed when remove_integer_superscripts=False."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "1", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=False); result = codeflash_output # 7.67μs -> 7.35μs (4.42% faster)

def test_empty_spans_list():
    """Test behavior with an empty spans list."""
    spans = []
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 1.40μs -> 1.50μs (6.94% slower)

def test_all_spans_are_integer_superscripts():
    """Test when all spans are integer superscripts that get removed."""
    spans = [
        {"text": "1", "block_num": 0, "line_num": 0, "span_num": 0, "flags": 1},
        {"text": "2", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "3", "block_num": 0, "line_num": 0, "span_num": 2, "flags": 1}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 4.16μs -> 3.83μs (8.64% faster)

def test_span_without_flags_key():
    """Test spans that don't have a 'flags' key are not processed for superscript removal."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.09μs -> 6.62μs (7.26% faster)

def test_span_with_whitespace_only():
    """Test handling of spans containing only whitespace."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "   ", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.99μs -> 7.46μs (7.09% faster)

def test_span_ending_with_hyphen_no_space_before():
    """Test line ending with hyphen without space before it, with join_with_space=False."""
    spans = [
        {"text": "Hello-", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "world", "block_num": 0, "line_num": 1, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 8.79μs -> 8.02μs (9.60% faster)

def test_span_ending_with_space():
    """Test line ending with a space."""
    spans = [
        {"text": "Hello ", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "World", "block_num": 0, "line_num": 1, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 8.98μs -> 8.49μs (5.77% faster)

def test_superscript_with_leading_trailing_spaces():
    """Test superscript with spaces around the digit."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "  1  ", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=True); result = codeflash_output # 9.08μs -> 8.17μs (11.1% faster)

def test_unsorted_spans_by_span_num():
    """Test that spans are correctly sorted by span_num within a line."""
    spans = [
        {"text": "first", "block_num": 0, "line_num": 0, "span_num": 2},
        {"text": "second", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "third", "block_num": 0, "line_num": 0, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 8.04μs -> 7.49μs (7.40% faster)

def test_unsorted_spans_by_line_num():
    """Test that spans are correctly sorted by line_num within a block."""
    spans = [
        {"text": "line2", "block_num": 0, "line_num": 1, "span_num": 0},
        {"text": "line1", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "line3", "block_num": 0, "line_num": 2, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 9.71μs -> 9.22μs (5.25% faster)

def test_unsorted_spans_by_block_num():
    """Test that spans are correctly sorted by block_num."""
    spans = [
        {"text": "block2", "block_num": 1, "line_num": 0, "span_num": 0},
        {"text": "block1", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "block3", "block_num": 2, "line_num": 0, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 9.58μs -> 8.94μs (7.19% faster)

def test_single_character_spans():
    """Test single character spans."""
    spans = [
        {"text": "a", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "b", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "c", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.93μs -> 7.54μs (5.15% faster)

def test_empty_string_spans():
    """Test empty string spans."""
    spans = [
        {"text": "", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 8.05μs -> 7.58μs (6.22% faster)

def test_spans_with_special_characters():
    """Test spans with special characters."""
    spans = [
        {"text": "Hello!", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "@#$%", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "World.", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.95μs -> 7.32μs (8.62% faster)

def test_spans_with_unicode_characters():
    """Test spans with unicode characters."""
    spans = [
        {"text": "Hëllo", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "Wørld", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "日本", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 8.43μs -> 8.01μs (5.27% faster)

def test_span_hyphen_with_space_before():
    """Test line ending with hyphen that has a space before it."""
    spans = [
        {"text": "test -", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "word", "block_num": 0, "line_num": 1, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 9.01μs -> 8.48μs (6.23% faster)

def test_zero_flags_value():
    """Test spans with flags set to 0 (no superscript)."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0, "flags": 0},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 7.74μs -> 7.06μs (9.64% faster)

def test_flags_with_multiple_bits_set():
    """Test spans with multiple flag bits set."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "1", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 3},  # bits 0 and 1 set
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=True); result = codeflash_output # 8.84μs -> 8.17μs (8.19% faster)

def test_multidigit_integer_superscript():
    """Test removal of multi-digit integer superscripts."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "123", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 8.35μs -> 7.85μs (6.40% faster)

def test_integer_with_leading_zeros():
    """Test integer superscript with leading zeros."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "001", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 8.63μs -> 8.08μs (6.74% faster)

def test_negative_integer_superscript():
    """Test that negative integers are not treated as superscripts (contain hyphen)."""
    spans = [
        {"text": "Hello", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "-1", "block_num": 0, "line_num": 0, "span_num": 1, "flags": 1},
        {"text": "World", "block_num": 0, "line_num": 0, "span_num": 2}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 9.55μs -> 9.04μs (5.63% faster)

def test_line_ending_with_hyphen_and_space_before_join_with_space():
    """Test line ending with hyphen (space before) using join_with_space=True."""
    spans = [
        {"text": "test -", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "word", "block_num": 0, "line_num": 1, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=True); result = codeflash_output # 8.82μs -> 8.21μs (7.52% faster)

def test_very_long_text_span():
    """Test handling of very long text in a single span."""
    long_text = "a" * 10000
    spans = [
        {"text": long_text, "block_num": 0, "line_num": 0, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 5.81μs -> 5.09μs (14.2% faster)

def test_complex_sorting_scenario():
    """Test complex scenario with multiple blocks, lines, and spans."""
    spans = [
        {"text": "B2L2S2", "block_num": 1, "line_num": 1, "span_num": 1},
        {"text": "B2L2S1", "block_num": 1, "line_num": 1, "span_num": 0},
        {"text": "B1L2S1", "block_num": 0, "line_num": 1, "span_num": 0},
        {"text": "B1L1S2", "block_num": 0, "line_num": 0, "span_num": 1},
        {"text": "B1L1S1", "block_num": 0, "line_num": 0, "span_num": 0},
        {"text": "B2L1S1", "block_num": 1, "line_num": 0, "span_num": 0}
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 12.7μs -> 11.8μs (7.55% faster)

def test_large_number_of_spans_single_block_single_line():
    """Test with a large number of spans on a single line in a single block."""
    # Create 500 spans on the same line and block
    spans = [
        {"text": f"word{i}", "block_num": 0, "line_num": 0, "span_num": i}
        for i in range(500)
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 246μs -> 204μs (20.3% faster)

def test_large_number_of_lines():
    """Test with a large number of lines within a single block."""
    # Create 500 lines with 2 spans each
    spans = []
    for line_num in range(500):
        spans.append({"text": f"line{line_num}word1", "block_num": 0, "line_num": line_num, "span_num": 0})
        spans.append({"text": f"line{line_num}word2", "block_num": 0, "line_num": line_num, "span_num": 1})
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 713μs -> 612μs (16.5% faster)

def test_large_number_of_blocks():
    """Test with a large number of blocks."""
    # Create 500 blocks with 1 span each
    spans = [
        {"text": f"block{i}text", "block_num": i, "line_num": 0, "span_num": 0}
        for i in range(500)
    ]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 365μs -> 332μs (10.2% faster)

def test_large_scale_with_superscript_removal():
    """Test superscript removal at large scale."""
    # Create 500 spans with every 5th span being an integer superscript
    spans = []
    for i in range(500):
        if i % 5 == 0:
            spans.append({"text": str(i // 5), "block_num": 0, "line_num": 0, "span_num": i, "flags": 1})
        else:
            spans.append({"text": f"word{i}", "block_num": 0, "line_num": 0, "span_num": i})
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 497μs -> 457μs (8.81% faster)
    # Should not contain just "0" as a word (it was removed as superscript)
    words = result.split()
    for word in words:
        if word.isdigit():
            pytest.fail(f"Found digit word {word} which should have been removed")

def test_large_scale_mixed_block_line_spans():
    """Test with a large number of blocks, lines, and spans mixed together."""
    # Create 200 blocks, each with 5 lines, each with 5 spans
    spans = []
    span_counter = 0
    for block_num in range(200):
        for line_num in range(5):
            for span_num in range(5):
                spans.append({
                    "text": f"B{block_num}L{line_num}S{span_num}",
                    "block_num": block_num,
                    "line_num": line_num,
                    "span_num": span_num
                })
                span_counter += 1
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 3.70ms -> 2.69ms (37.7% faster)

def test_large_scale_unsorted_spans():
    """Test large scale with unsorted spans to verify sorting performance."""
    # Create 300 spans in reverse order
    spans = []
    for i in range(300, 0, -1):
        spans.append({
            "text": f"span{i}",
            "block_num": (i - 1) // 30,  # 10 blocks
            "line_num": (i - 1) % 30 // 3,  # 10 lines per block
            "span_num": (i - 1) % 3
        })
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 249μs -> 172μs (44.2% faster)

def test_large_scale_with_unicode_spans():
    """Test large scale with unicode characters."""
    # Create 300 spans with various unicode characters
    unicode_chars = ["日", "中", "文", "テ", "スト", "한", "글", "Ñ", "Ü", "É"]
    spans = []
    for i in range(300):
        char = unicode_chars[i % len(unicode_chars)]
        spans.append({
            "text": f"{char}{i}",
            "block_num": i // 30,
            "line_num": (i % 30) // 10,
            "span_num": i % 10
        })
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 224μs -> 144μs (55.8% faster)

def test_large_scale_join_without_space():
    """Test large scale with join_with_space=False."""
    spans = [
        {"text": f"word{i}", "block_num": 0, "line_num": 0, "span_num": i}
        for i in range(300)
    ]
    codeflash_output = extract_text_from_spans(spans, join_with_space=False); result = codeflash_output # 150μs -> 124μs (20.7% faster)

def test_large_scale_complex_superscripts():
    """Test large scale with complex superscript patterns."""
    spans = []
    for i in range(400):
        if i % 7 == 0:  # Every 7th span is a superscript
            spans.append({
                "text": str(i // 7),
                "block_num": i // 100,
                "line_num": (i % 100) // 10,
                "span_num": i % 10,
                "flags": 1
            })
        else:
            spans.append({
                "text": f"text{i}",
                "block_num": i // 100,
                "line_num": (i % 100) // 10,
                "span_num": i % 10
            })
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 412μs -> 315μs (30.7% faster)

def test_edge_case_single_span_large_scale():
    """Test a single very large span (edge case at large scale)."""
    large_text = "A" * 50000
    spans = [{"text": large_text, "block_num": 0, "line_num": 0, "span_num": 0}]
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 5.63μs -> 5.18μs (8.73% faster)

def test_large_scale_many_empty_spans():
    """Test with many empty spans mixed with text spans."""
    spans = []
    for i in range(400):
        if i % 2 == 0:
            spans.append({
                "text": "",
                "block_num": i // 40,
                "line_num": (i % 40) // 4,
                "span_num": i % 4
            })
        else:
            spans.append({
                "text": f"word{i}",
                "block_num": i // 40,
                "line_num": (i % 40) // 4,
                "span_num": i % 4
            })
    codeflash_output = extract_text_from_spans(spans); result = codeflash_output # 297μs -> 212μs (39.9% faster)

def test_large_scale_spans_with_flags_no_removal():
    """Test large scale with flags set but removal disabled."""
    spans = []
    for i in range(350):
        if i % 10 == 0:
            spans.append({
                "text": str(i // 10),
                "block_num": i // 50,
                "line_num": (i % 50) // 5,
                "span_num": i % 5,
                "flags": 1
            })
        else:
            spans.append({
                "text": f"word{i}",
                "block_num": i // 50,
                "line_num": (i % 50) // 5,
                "span_num": i % 5
            })
    codeflash_output = extract_text_from_spans(spans, remove_integer_superscripts=False); result = codeflash_output # 261μs -> 168μs (55.2% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-extract_text_from_spans-mkot6ym0 and push.

Codeflash Static Badge

The optimized code achieves a **30% speedup** by replacing three consecutive sorts with a single sort using a tuple key.

## Key Optimization

**Original approach:**
```python
spans_copy.sort(key=lambda span: span["span_num"])
spans_copy.sort(key=lambda span: span["line_num"])
spans_copy.sort(key=lambda span: span["block_num"])
```

**Optimized approach:**
```python
spans_copy.sort(key=lambda span: (span["block_num"], span["line_num"], span["span_num"]))
```

## Why This Works

Python's sort is stable, so the original code sorts by `span_num`, then re-sorts by `line_num` (preserving `span_num` order within each line), then re-sorts by `block_num` (preserving the previous ordering). However, this executes the sorting algorithm **three times**.

The optimized version leverages Python's native tuple comparison: when sorting by `(block_num, line_num, span_num)`, Python automatically compares block first, then line within the same block, then span within the same line. This achieves the identical final ordering in **a single pass**.

## Performance Impact

From the line profiler data:
- **Original:** Three sorts take ~14.8ms total (9.5% + 9.8% + 9.5% of 51.68ms)
- **Optimized:** Single sort takes ~6.6ms (15.7% of 42.14ms)

The single-pass sort is **~2.2x faster** than three separate sorts, directly contributing to the overall 30% speedup.

## Test Results Show Strong Gains at Scale

The optimization particularly shines with larger datasets:
- `test_large_scale_many_spans_sorted_and_joined_correctly`: **50.6% faster** (344μs → 228μs)
- `test_large_scale_unsorted_spans`: **44.2% faster** (249μs → 172μs)
- `test_large_scale_with_unicode_spans`: **55.8% faster** (224μs → 144μs)
- `test_large_scale_mixed_block_line_spans`: **37.7% faster** (3.70ms → 2.69ms)

Smaller test cases show modest 5-15% improvements, confirming the optimization's value scales with input size.

## Workload Context

Based on `function_references`, this function is called by `extract_text_inside_bbox()` during table text extraction. Since table processing often involves:
- Many spans per table cell
- Repeated calls for multiple bounding boxes
- Potentially large documents with numerous tables

The 30% speedup directly reduces latency in document parsing pipelines, especially for documents with complex table structures containing many text spans that need proper ordering.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 22, 2026 02:04
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants