⚡️ Speed up function extract_text_from_spans by 31%#40
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up function extract_text_from_spans by 31%#40codeflash-ai[bot] wants to merge 1 commit intomainfrom
extract_text_from_spans by 31%#40codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves a **30% speedup** by replacing three consecutive sorts with a single sort using a tuple key. ## Key Optimization **Original approach:** ```python spans_copy.sort(key=lambda span: span["span_num"]) spans_copy.sort(key=lambda span: span["line_num"]) spans_copy.sort(key=lambda span: span["block_num"]) ``` **Optimized approach:** ```python spans_copy.sort(key=lambda span: (span["block_num"], span["line_num"], span["span_num"])) ``` ## Why This Works Python's sort is stable, so the original code sorts by `span_num`, then re-sorts by `line_num` (preserving `span_num` order within each line), then re-sorts by `block_num` (preserving the previous ordering). However, this executes the sorting algorithm **three times**. The optimized version leverages Python's native tuple comparison: when sorting by `(block_num, line_num, span_num)`, Python automatically compares block first, then line within the same block, then span within the same line. This achieves the identical final ordering in **a single pass**. ## Performance Impact From the line profiler data: - **Original:** Three sorts take ~14.8ms total (9.5% + 9.8% + 9.5% of 51.68ms) - **Optimized:** Single sort takes ~6.6ms (15.7% of 42.14ms) The single-pass sort is **~2.2x faster** than three separate sorts, directly contributing to the overall 30% speedup. ## Test Results Show Strong Gains at Scale The optimization particularly shines with larger datasets: - `test_large_scale_many_spans_sorted_and_joined_correctly`: **50.6% faster** (344μs → 228μs) - `test_large_scale_unsorted_spans`: **44.2% faster** (249μs → 172μs) - `test_large_scale_with_unicode_spans`: **55.8% faster** (224μs → 144μs) - `test_large_scale_mixed_block_line_spans`: **37.7% faster** (3.70ms → 2.69ms) Smaller test cases show modest 5-15% improvements, confirming the optimization's value scales with input size. ## Workload Context Based on `function_references`, this function is called by `extract_text_inside_bbox()` during table text extraction. Since table processing often involves: - Many spans per table cell - Repeated calls for multiple bounding boxes - Potentially large documents with numerous tables The 30% speedup directly reduces latency in document parsing pipelines, especially for documents with complex table structures containing many text spans that need proper ordering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 31% (0.31x) speedup for
extract_text_from_spansinunstructured_inference/models/table_postprocess.py⏱️ Runtime :
7.78 milliseconds→5.95 milliseconds(best of95runs)📝 Explanation and details
The optimized code achieves a 30% speedup by replacing three consecutive sorts with a single sort using a tuple key.
Key Optimization
Original approach:
Optimized approach:
Why This Works
Python's sort is stable, so the original code sorts by
span_num, then re-sorts byline_num(preservingspan_numorder within each line), then re-sorts byblock_num(preserving the previous ordering). However, this executes the sorting algorithm three times.The optimized version leverages Python's native tuple comparison: when sorting by
(block_num, line_num, span_num), Python automatically compares block first, then line within the same block, then span within the same line. This achieves the identical final ordering in a single pass.Performance Impact
From the line profiler data:
The single-pass sort is ~2.2x faster than three separate sorts, directly contributing to the overall 30% speedup.
Test Results Show Strong Gains at Scale
The optimization particularly shines with larger datasets:
test_large_scale_many_spans_sorted_and_joined_correctly: 50.6% faster (344μs → 228μs)test_large_scale_unsorted_spans: 44.2% faster (249μs → 172μs)test_large_scale_with_unicode_spans: 55.8% faster (224μs → 144μs)test_large_scale_mixed_block_line_spans: 37.7% faster (3.70ms → 2.69ms)Smaller test cases show modest 5-15% improvements, confirming the optimization's value scales with input size.
Workload Context
Based on
function_references, this function is called byextract_text_inside_bbox()during table text extraction. Since table processing often involves:The 30% speedup directly reduces latency in document parsing pipelines, especially for documents with complex table structures containing many text spans that need proper ordering.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-extract_text_from_spans-mkot6ym0and push.