Skip to content

⚡️ Speed up function extract_text_inside_bbox by 187%#37

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-extract_text_inside_bbox-mkosi66t
Open

⚡️ Speed up function extract_text_inside_bbox by 187%#37
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-extract_text_inside_bbox-mkosi66t

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 22, 2026

📄 187% (1.87x) speedup for extract_text_inside_bbox in unstructured_inference/models/table_postprocess.py

⏱️ Runtime : 14.9 milliseconds 5.19 milliseconds (best of 155 runs)

📝 Explanation and details

The optimized code achieves a 186% speedup (14.9ms → 5.19ms) through three key optimizations:

1. Inlined Bbox Intersection Test (~43% faster in get_bbox_span_subset)

The original code calls overlaps() for every span, which constructs two Rect objects, computes areas, and performs intersection calculations. The optimized version inlines the numeric intersection test directly, avoiding object construction overhead. For the common case where bbox is a 4-element sequence, it:

  • Unpacks coordinates once for the search bbox
  • Performs direct float arithmetic for intersection area
  • Only falls back to overlaps() for malformed inputs

This is particularly effective since get_bbox_span_subset is the hottest function (74.5% of total time in original), and the test cases show 67-277% speedup when filtering large span sets.

2. Single Tuple-Key Sort (~27ms → ~2ms in extract_text_from_spans)

The original performs three separate stable sorts on spans_copy:

spans_copy.sort(key=lambda span: span["span_num"])
spans_copy.sort(key=lambda span: span["line_num"])  
spans_copy.sort(key=lambda span: span["block_num"])

Each sort is O(n log n), totaling ~27% of runtime. The optimized version uses a single sort with a composite tuple key:

spans_copy.sort(key=lambda span: (span["block_num"], span["line_num"], span["span_num"]))

Python's tuple comparison naturally produces the same ordering in one pass, reducing sort overhead by ~3x.

3. Eliminated O(n²) list.remove() (~127μs → ~34μs for superscript removal)

The original builds spans_copy = spans[:] then calls spans_copy.remove(span) for each integer superscript—O(n) per removal. With 102 removals in the profile, this is quadratic. The optimized version builds the filtered list in one pass by appending only non-removed spans, making it O(n).

Impact on Workloads

Based on function_references, extract_text_inside_bbox is called from remove_objects_without_content() in a loop over objects. Since table extraction likely involves many cells/rows/columns, this optimization compounds across repeated calls. Test cases with 500-800 spans show 151-277% speedup, making batch processing of tables significantly faster. The optimization is particularly valuable when:

  • Extracting text from many table cells (large-scale tests show 2-3x gains)
  • Processing documents with dense span data (the 800-span test improved by 213%)
  • Filtering spans with high rejection rates (performance test with mostly-outside spans: 277% faster)

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 40 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from typing import Dict, List, Tuple

# imports
import pytest  # used for our unit tests
from unstructured_inference.models.table_postprocess import (
    extract_text_from_spans, extract_text_inside_bbox, overlaps)

# Minimal Rect class to support the original functions' geometry calculations.
# This is necessary because the original extract_text_inside_bbox function
# refers to Rect. We implement only the methods used by the original code:
# constructor(Rect(list)), get_area(), intersect(Rect) -> Rect.
class Rect:
    def __init__(self, bbox: List[float]):
        # Expect bbox as [x0, y0, x1, y1]
        self.x0 = float(bbox[0])
        self.y0 = float(bbox[1])
        self.x1 = float(bbox[2])
        self.y1 = float(bbox[3])

    def get_area(self) -> float:
        width = max(0.0, self.x1 - self.x0)
        height = max(0.0, self.y1 - self.y0)
        return width * height

    def intersect(self, other: "Rect") -> "Rect":
        ix0 = max(self.x0, other.x0)
        iy0 = max(self.y0, other.y0)
        ix1 = min(self.x1, other.x1)
        iy1 = min(self.y1, other.y1)
        # If no intersection, return a rect with zero area (x0 >= x1 or y0 >= y1)
        if ix0 >= ix1 or iy0 >= iy1:
            return Rect([0.0, 0.0, 0.0, 0.0])
        return Rect([ix0, iy0, ix1, iy1])

def test_single_span_inside_bbox_returns_text_and_span_list():
    # Single span with bounding box matching search bbox exactly
    spans = [
        {
            "bbox": [0.0, 0.0, 10.0, 10.0],
            "text": "hello",
            "span_num": 0,
            "line_num": 0,
            "block_num": 0,
        }
    ]
    bbox = [0.0, 0.0, 10.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 13.7μs -> 9.92μs (37.9% faster)

def test_integer_superscript_removed_from_text_but_span_still_returned_in_subset():
    # Two spans: one normal, one superscript integer "2"
    normal_span = {
        "bbox": [0.0, 0.0, 10.0, 10.0],
        "text": "A",
        "span_num": 0,
        "line_num": 0,
        "block_num": 0,
    }
    superscript_span = {
        "bbox": [0.0, 0.0, 10.0, 10.0],
        "text": "2",
        "flags": 1,  # 2**0 flag set
        "span_num": 1,
        "line_num": 0,
        "block_num": 0,
    }
    spans = [normal_span, superscript_span]
    bbox = [0.0, 0.0, 10.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 18.8μs -> 12.1μs (54.8% faster)

def test_non_digit_superscript_marked_and_present_in_text():
    # Span with superscript flag but non-digit content
    span = {
        "bbox": [0.0, 0.0, 10.0, 10.0],
        "text": "th",  # not digit
        "flags": 1,
        "span_num": 0,
        "line_num": 0,
        "block_num": 0,
    }
    spans = [span]
    bbox = [0.0, 0.0, 10.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 14.7μs -> 10.7μs (36.5% faster)

def test_span_overlap_below_threshold_excluded():
    # Span area is 100 (10x10). Intersection with search bbox is 2x10 = 20 => 0.2 fraction.
    span = {
        "bbox": [0.0, 0.0, 10.0, 10.0],
        "text": "partial",
        "span_num": 0,
        "line_num": 0,
        "block_num": 0,
    }
    spans = [span]
    # Search bbox only intersects rightmost 2 units of span -> overlap fraction 0.2
    bbox = [8.0, 0.0, 20.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 9.64μs -> 5.75μs (67.7% faster)

def test_zero_area_span_excluded():
    # Span with zero width (x0 == x1) => area 0
    span = {
        "bbox": [5.0, 0.0, 5.0, 10.0],  # zero width
        "text": "degenerate",
        "span_num": 0,
        "line_num": 0,
        "block_num": 0,
    }
    spans = [span]
    bbox = [0.0, 0.0, 10.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 5.04μs -> 4.89μs (3.09% faster)

def test_multiple_spans_are_sorted_and_spaced_correctly():
    # Unordered spans across two lines but same block. The function should sort by
    # block_num, line_num, span_num and assemble text with spaces.
    spans = [
        {"bbox": [0, 0, 5, 5], "text": "world", "span_num": 1, "line_num": 0, "block_num": 0},
        {"bbox": [0, 5, 5, 10], "text": "New", "span_num": 2, "line_num": 1, "block_num": 0},
        {"bbox": [0, 5, 5, 10], "text": "line", "span_num": 3, "line_num": 1, "block_num": 0},
        {"bbox": [0, 0, 5, 5], "text": "Hello", "span_num": 0, "line_num": 0, "block_num": 0},
    ]
    bbox = [0.0, 0.0, 10.0, 10.0]

    text, subset = extract_text_inside_bbox(spans, bbox) # 28.4μs -> 18.2μs (55.9% faster)

def test_extract_text_from_spans_hyphen_line_end_no_space():
    # Two lines: first line ends with hyphen and previous char is not space -> no extra space is added
    spans = [
        # First line, two pieces making end with 'end-'
        {"bbox": [0, 0, 5, 5], "text": "end", "span_num": 0, "line_num": 0, "block_num": 0},
        {"bbox": [5, 0, 10, 5], "text": "-", "span_num": 1, "line_num": 0, "block_num": 0},
        # Second line, starts with "next"
        {"bbox": [0, 5, 5, 10], "text": "next", "span_num": 2, "line_num": 1, "block_num": 0},
    ]

    # Use join_with_space=False to test the special-case that inserts exactly one space
    # except when the line ends with hyphen with non-space preceding character.
    assembled = extract_text_from_spans(spans, join_with_space=False, remove_integer_superscripts=False)

def test_large_scale_many_spans_half_inside_bbox():
    # Create 800 spans (under the 1000-element guideline). Alternate bounding boxes
    # so every even-indexed span is inside the search bbox and odd-indexed is outside.
    n = 800
    spans = []
    for i in range(n):
        if i % 2 == 0:
            # inside bbox
            bbox_span = [0.0, float(i), 1.0, float(i + 1)]
            text = f"t{i}"
        else:
            # outside bbox (shifted far to the right)
            bbox_span = [100.0, float(i), 101.0, float(i + 1)]
            text = f"u{i}"
        spans.append(
            {
                "bbox": bbox_span,
                "text": text,
                "span_num": i,
                "line_num": i // 5,  # group into lines of 5 spans for variety
                "block_num": 0 if i < n // 2 else 1,  # some in block 0, some in block 1
            }
        )

    # Search bbox includes x in [0,2] so only even-indexed spans included
    search_bbox = [0.0, 0.0, 2.0, float(n + 10)]

    text, subset = extract_text_inside_bbox(spans, search_bbox) # 2.75ms -> 878μs (213% faster)

    # Expect number of included spans to be n//2 rounded up for even count -> exactly n/2
    expected_included = n // 2

    # Verify assembled text includes only the 't{even}' tokens in the correct sorted order.
    # Because block_num and line_num vary, we construct expected assembled text by filtering
    # the original spans and assembling using the same rules used by extract_text_from_spans.
    included_spans = [s for s in spans if overlaps(s["bbox"], search_bbox, 0.5)]
    expected_text = extract_text_from_spans(included_spans, remove_integer_superscripts=True)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unittest.mock import MagicMock, Mock, patch

import pytest
from unstructured_inference.models.table_postprocess import \
    extract_text_inside_bbox

class TestExtractTextInsideBboxBasic:
    """Basic functionality tests for extract_text_inside_bbox."""

    def test_single_span_fully_inside_bbox(self):
        """Test extraction with a single span fully contained within the bbox."""
        spans = [
            {
                "text": "Hello",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 14.1μs -> 10.5μs (33.3% faster)

    def test_multiple_spans_all_inside_bbox(self):
        """Test extraction with multiple spans all within the bbox."""
        spans = [
            {
                "text": "Hello",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "world",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 18.8μs -> 13.4μs (40.8% faster)

    def test_span_outside_bbox(self):
        """Test that spans outside the bbox are not included."""
        spans = [
            {
                "text": "Hello",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "Outside",
                "bbox": [150, 150, 200, 180],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 18.2μs -> 12.0μs (52.1% faster)

    def test_empty_spans_list(self):
        """Test extraction with an empty spans list."""
        spans = []
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 2.28μs -> 3.73μs (38.8% slower)

    def test_no_spans_inside_bbox(self):
        """Test when no spans fall within the bbox."""
        spans = [
            {
                "text": "Far away",
                "bbox": [200, 200, 300, 250],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 10.1μs -> 6.31μs (60.2% faster)

    def test_partial_overlap_above_threshold(self):
        """Test span with partial overlap above the default 50% threshold."""
        spans = [
            {
                "text": "Test",
                "bbox": [40, 40, 90, 90],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        # Bbox overlap is 50x50 out of 50x50 = 100%, which is >= 50%
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 14.1μs -> 10.5μs (34.1% faster)

    def test_return_tuple_structure(self):
        """Test that the function returns a tuple of (text, spans)."""
        spans = [
            {
                "text": "Test",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        codeflash_output = extract_text_inside_bbox(spans, bbox); result = codeflash_output # 14.1μs -> 10.3μs (37.1% faster)

class TestExtractTextInsideBboxEdge:
    """Edge case tests for extract_text_inside_bbox."""

    def test_bbox_with_zero_area(self):
        """Test when bbox has zero area (collapsed bbox)."""
        spans = [
            {
                "text": "Text",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        # Collapsed bbox: x0==x1
        bbox = [10, 10, 10, 30]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 10.6μs -> 3.22μs (230% faster)

    def test_span_with_zero_area(self):
        """Test when a span has zero area."""
        spans = [
            {
                "text": "Point",
                "bbox": [25, 20, 25, 40],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 5.68μs -> 5.39μs (5.41% faster)

    def test_span_at_bbox_boundary(self):
        """Test spans exactly at the boundaries of the bbox."""
        spans = [
            {
                "text": "Edge",
                "bbox": [0, 0, 50, 50],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 14.3μs -> 10.6μs (35.9% faster)

    def test_multiple_blocks_and_lines(self):
        """Test spans from multiple blocks and lines."""
        spans = [
            {
                "text": "Block0Line0",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "Block0Line1",
                "bbox": [10, 40, 50, 60],
                "block_num": 0,
                "line_num": 1,
                "span_num": 1,
            },
            {
                "text": "Block1Line0",
                "bbox": [10, 80, 50, 100],
                "block_num": 1,
                "line_num": 0,
                "span_num": 2,
            },
        ]
        bbox = [0, 0, 100, 120]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 24.9μs -> 16.8μs (48.1% faster)

    def test_span_with_integer_superscript_flag(self):
        """Test that integer superscripts are removed when flag is set."""
        spans = [
            {
                "text": "Hello",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "2",
                "bbox": [55, 5, 65, 15],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
                "flags": 1,  # superscript flag (2**0 = 1)
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.4μs -> 12.9μs (49.9% faster)

    def test_span_with_superscript_non_integer(self):
        """Test that non-integer superscripts are kept."""
        spans = [
            {
                "text": "Test",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "a",
                "bbox": [55, 5, 65, 15],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
                "flags": 1,  # superscript flag
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 20.5μs -> 14.5μs (41.4% faster)

    def test_span_without_flags_key(self):
        """Test that spans without 'flags' key are handled correctly."""
        spans = [
            {
                "text": "NoFlags",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "WithFlags",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
                "flags": 0,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.6μs -> 13.8μs (42.2% faster)

    def test_negative_coordinates(self):
        """Test with negative bbox coordinates."""
        spans = [
            {
                "text": "Negative",
                "bbox": [-50, -50, -10, -10],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [-100, -100, 0, 0]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 14.0μs -> 10.4μs (33.9% faster)

    def test_very_large_coordinates(self):
        """Test with very large bbox coordinates."""
        spans = [
            {
                "text": "Large",
                "bbox": [1000000, 1000000, 2000000, 3000000],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 3000000, 4000000]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 14.4μs -> 10.7μs (34.5% faster)

    def test_empty_text_span(self):
        """Test with a span containing empty text."""
        spans = [
            {
                "text": "",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "Content",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.5μs -> 13.4μs (45.5% faster)

    def test_span_with_whitespace_only(self):
        """Test with spans containing only whitespace."""
        spans = [
            {
                "text": "   ",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "Text",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.3μs -> 13.5μs (43.7% faster)

    def test_span_with_special_characters(self):
        """Test spans containing special characters and unicode."""
        spans = [
            {
                "text": "Hello\u2019s",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "\u00e9\u00e8\u00ea",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 20.0μs -> 14.1μs (41.6% faster)

    def test_partial_overlap_below_threshold(self):
        """Test span with partial overlap below the 50% threshold."""
        spans = [
            {
                "text": "Barely",
                "bbox": [60, 60, 100, 100],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        # Only a 40x40 portion of the 40x40 span overlaps = 100% overlap
        # Need a span that overlaps less than 50%
        spans = [
            {
                "text": "Barely",
                "bbox": [80, 80, 110, 110],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            }
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 10.0μs -> 6.76μs (48.3% faster)

    def test_span_numbers_not_sequential(self):
        """Test spans with non-sequential span numbers."""
        spans = [
            {
                "text": "First",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 5,
            },
            {
                "text": "Second",
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 2,
            },
            {
                "text": "Third",
                "bbox": [95, 10, 130, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 10,
            },
        ]
        bbox = [0, 0, 150, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 23.8μs -> 15.6μs (52.4% faster)

    def test_overlapping_spans_same_position(self):
        """Test multiple spans at the same position."""
        spans = [
            {
                "text": "First",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": "Second",
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.3μs -> 13.4μs (44.6% faster)

class TestExtractTextInsideBboxLargeScale:
    """Large-scale tests for extract_text_inside_bbox."""

    def test_many_spans_all_inside_bbox(self):
        """Test with a large number of spans all inside the bbox."""
        # Create 500 spans arranged in a grid
        spans = []
        for i in range(500):
            row = i // 10
            col = i % 10
            spans.append({
                "text": f"Text{i}",
                "bbox": [col * 10, row * 10, col * 10 + 8, row * 10 + 8],
                "block_num": row // 5,
                "line_num": row % 5,
                "span_num": i,
            })
        
        bbox = [0, 0, 200, 500]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 1.88ms -> 689μs (172% faster)

    def test_many_spans_mixed_inside_outside(self):
        """Test with many spans, some inside and some outside the bbox."""
        # Create 500 spans, half inside bbox, half outside
        spans = []
        for i in range(500):
            if i < 250:
                # Inside bbox
                x = (i % 20) * 5
                y = (i // 20) * 5
                bbox_val = [x, y, x + 4, y + 4]
            else:
                # Outside bbox
                bbox_val = [500 + i * 2, 500 + i * 2, 510 + i * 2, 510 + i * 2]
            
            spans.append({
                "text": f"Text{i}",
                "bbox": bbox_val,
                "block_num": i // 100,
                "line_num": (i % 100) // 10,
                "span_num": i,
            })
        
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 1.66ms -> 541μs (206% faster)

    def test_large_number_of_blocks_and_lines(self):
        """Test with spans distributed across many blocks and lines."""
        # Create 300 spans across 10 blocks with 30 lines each
        spans = []
        span_num = 0
        for block in range(10):
            for line in range(30):
                spans.append({
                    "text": f"B{block}L{line}",
                    "bbox": [10 + line * 5, 10 + block * 50, 14 + line * 5, 40 + block * 50],
                    "block_num": block,
                    "line_num": line,
                    "span_num": span_num,
                })
                span_num += 1
        
        bbox = [0, 0, 500, 600]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 1.22ms -> 485μs (151% faster)

    def test_performance_with_many_spans_outside_bbox(self):
        """Test performance when most spans are outside the bbox."""
        # Create 800 spans, only a few inside the bbox
        spans = []
        for i in range(800):
            if i < 10:
                # Small region inside
                bbox_val = [i * 2, i * 2, i * 2 + 1, i * 2 + 1]
            else:
                # Far away
                bbox_val = [2000 + i * 10, 2000 + i * 10, 2010 + i * 10, 2010 + i * 10]
            
            spans.append({
                "text": f"Text{i}",
                "bbox": bbox_val,
                "block_num": i // 100,
                "line_num": (i % 100) // 10,
                "span_num": i,
            })
        
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 2.36ms -> 626μs (277% faster)

    def test_large_text_content(self):
        """Test with spans containing large text content."""
        # Create spans with long text
        long_text = "A" * 1000
        spans = [
            {
                "text": long_text,
                "bbox": [10, 10, 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 0,
            },
            {
                "text": long_text,
                "bbox": [55, 10, 90, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": 1,
            },
        ]
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 19.7μs -> 13.8μs (43.3% faster)

    def test_wide_range_of_span_numbers(self):
        """Test with spans having a wide range of span_num values."""
        spans = []
        span_nums = [0, 100, 1000, 10000, 100000]
        for i, span_num in enumerate(span_nums):
            spans.append({
                "text": f"Text{i}",
                "bbox": [i * 20, 10, i * 20 + 15, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": span_num,
            })
        
        bbox = [0, 0, 200, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 31.0μs -> 18.3μs (69.3% faster)

    def test_many_superscript_integer_removals(self):
        """Test removal of many integer superscripts."""
        spans = []
        for i in range(100):
            # Add regular text
            spans.append({
                "text": f"Text{i}",
                "bbox": [i * 2, 10, i * 2 + 1, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": i * 2,
            })
            # Add integer superscript
            spans.append({
                "text": str(i % 10),
                "bbox": [i * 2 + 0.5, 5, i * 2 + 1.5, 15],
                "block_num": 0,
                "line_num": 0,
                "span_num": i * 2 + 1,
                "flags": 1,  # superscript
            })
        
        bbox = [0, 0, 500, 50]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 795μs -> 250μs (217% faster)

    def test_many_different_blocks_same_line(self):
        """Test with many blocks all on the same line."""
        spans = []
        for block in range(100):
            for i in range(5):
                spans.append({
                    "text": f"B{block}T{i}",
                    "bbox": [block * 50 + i * 8, 10, block * 50 + i * 8 + 7, 30],
                    "block_num": block,
                    "line_num": 0,
                    "span_num": block * 5 + i,
                })
        
        bbox = [0, 0, 6000, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 1.84ms -> 696μs (164% faster)

    def test_mixed_overlap_thresholds(self):
        """Test with spans having varying degrees of overlap."""
        spans = []
        # Create spans with different overlap percentages
        for i in range(100):
            # Span that overlaps progressively less
            overlap_factor = (100 - i) / 100.0
            x_start = 50 - (50 * overlap_factor)
            spans.append({
                "text": f"Text{i}",
                "bbox": [x_start, 10, x_start + 50, 30],
                "block_num": 0,
                "line_num": 0,
                "span_num": i,
            })
        
        bbox = [0, 0, 100, 100]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 383μs -> 141μs (170% faster)

    def test_stress_test_sorting_performance(self):
        """Stress test the sorting logic with many spans needing sort."""
        spans = []
        # Create 400 spans in reverse order across multiple blocks and lines
        for i in range(400, 0, -1):
            block = (400 - i) // 40
            line = ((400 - i) % 40) // 8
            span_num = (400 - i) % 8
            spans.append({
                "text": f"Text{i}",
                "bbox": [10 + span_num * 10, 10 + block * 50 + line * 5, 18 + span_num * 10, 25 + block * 50 + line * 5],
                "block_num": block,
                "line_num": line,
                "span_num": span_num,
            })
        
        bbox = [0, 0, 500, 500]
        text, returned_spans = extract_text_inside_bbox(spans, bbox) # 1.52ms -> 545μs (180% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-extract_text_inside_bbox-mkosi66t and push.

Codeflash Static Badge

The optimized code achieves a **186% speedup** (14.9ms → 5.19ms) through three key optimizations:

## 1. **Inlined Bbox Intersection Test** (~43% faster in `get_bbox_span_subset`)
The original code calls `overlaps()` for every span, which constructs two `Rect` objects, computes areas, and performs intersection calculations. The optimized version **inlines the numeric intersection test directly**, avoiding object construction overhead. For the common case where bbox is a 4-element sequence, it:
- Unpacks coordinates once for the search bbox
- Performs direct float arithmetic for intersection area
- Only falls back to `overlaps()` for malformed inputs

This is particularly effective since `get_bbox_span_subset` is the hottest function (74.5% of total time in original), and the test cases show 67-277% speedup when filtering large span sets.

## 2. **Single Tuple-Key Sort** (~27ms → ~2ms in `extract_text_from_spans`)
The original performs **three separate stable sorts** on `spans_copy`:
```python
spans_copy.sort(key=lambda span: span["span_num"])
spans_copy.sort(key=lambda span: span["line_num"])  
spans_copy.sort(key=lambda span: span["block_num"])
```
Each sort is O(n log n), totaling ~27% of runtime. The optimized version uses a **single sort with a composite tuple key**:
```python
spans_copy.sort(key=lambda span: (span["block_num"], span["line_num"], span["span_num"]))
```
Python's tuple comparison naturally produces the same ordering in one pass, reducing sort overhead by ~3x.

## 3. **Eliminated O(n²) list.remove()** (~127μs → ~34μs for superscript removal)
The original builds `spans_copy = spans[:]` then calls `spans_copy.remove(span)` for each integer superscript—**O(n) per removal**. With 102 removals in the profile, this is quadratic. The optimized version **builds the filtered list in one pass** by appending only non-removed spans, making it O(n).

## Impact on Workloads
Based on `function_references`, `extract_text_inside_bbox` is called from `remove_objects_without_content()` in a loop over objects. Since table extraction likely involves many cells/rows/columns, this optimization **compounds across repeated calls**. Test cases with 500-800 spans show 151-277% speedup, making batch processing of tables significantly faster. The optimization is particularly valuable when:
- Extracting text from many table cells (large-scale tests show 2-3x gains)
- Processing documents with dense span data (the 800-span test improved by 213%)
- Filtering spans with high rejection rates (performance test with mostly-outside spans: 277% faster)
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 22, 2026 01:45
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants