Open
Conversation
The optimized code achieves a **12% speedup** by eliminating redundant array indexing operations within the NMS loop. **Key Optimization:** Instead of repeatedly indexing `order[1:]` multiple times per iteration (5 times in the original code for coordinate comparisons plus area lookups), the optimized version extracts this slice once into a variable `rest`. This single extraction is then reused for all subsequent operations. **Why This Works:** 1. **Reduced Array Slicing Overhead**: Each `order[1:]` operation creates a new array view and involves pointer arithmetic and bounds checking. By doing this once instead of 5+ times per iteration (across 2,237 loop iterations based on profiler data), we save significant overhead. 2. **Improved Memory Access Pattern**: The `rest` variable maintains better cache locality since we're reusing the same slice reference rather than recreating it multiple times. 3. **Boolean Indexing vs np.where**: The change from `np.where(ovr <= nms_thr)[0]` to direct boolean masking (`mask = ovr <= nms_thr; order = rest[mask]`) eliminates the function call overhead of `np.where` and the subsequent integer addition operation `inds + 1`. Line profiler shows this reduces time from ~14.6ms to ~10.5ms across the loop iterations. **Performance Impact:** Based on the line profiler results: - The coordinate extraction lines (xx1, yy1, xx2, yy2) show modest improvements (~0.5-0.6ms total saved) - The masking operation shows the biggest win: from 8.7ms (np.where) + 6.0ms (indexing) = 14.7ms down to 4.1ms + 3.6ms = 7.7ms—a **~7ms savings** **Context & Impact:** The `nms` function is called from `multiclass_nms_class_agnostic`, which processes detection results after filtering by score threshold. Since this is in a post-processing path for object detection (YOLOX model), even small speedups compound when processing multiple images or video frames. The optimization is most beneficial for: - Cases with many overlapping boxes (as shown by the 22-26% speedup in tests like `test_one_threshold` and `test_boxes_with_zero_area`) - Large-scale scenarios with 100+ boxes (11-14% improvements consistently observed)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 12% (0.12x) speedup for
nmsinunstructured_inference/models/yolox.py⏱️ Runtime :
57.5 milliseconds→51.3 milliseconds(best of61runs)📝 Explanation and details
The optimized code achieves a 12% speedup by eliminating redundant array indexing operations within the NMS loop.
Key Optimization:
Instead of repeatedly indexing
order[1:]multiple times per iteration (5 times in the original code for coordinate comparisons plus area lookups), the optimized version extracts this slice once into a variablerest. This single extraction is then reused for all subsequent operations.Why This Works:
Reduced Array Slicing Overhead: Each
order[1:]operation creates a new array view and involves pointer arithmetic and bounds checking. By doing this once instead of 5+ times per iteration (across 2,237 loop iterations based on profiler data), we save significant overhead.Improved Memory Access Pattern: The
restvariable maintains better cache locality since we're reusing the same slice reference rather than recreating it multiple times.Boolean Indexing vs np.where: The change from
np.where(ovr <= nms_thr)[0]to direct boolean masking (mask = ovr <= nms_thr; order = rest[mask]) eliminates the function call overhead ofnp.whereand the subsequent integer addition operationinds + 1. Line profiler shows this reduces time from ~14.6ms to ~10.5ms across the loop iterations.Performance Impact:
Based on the line profiler results:
Context & Impact:
The
nmsfunction is called frommulticlass_nms_class_agnostic, which processes detection results after filtering by score threshold. Since this is in a post-processing path for object detection (YOLOX model), even small speedups compound when processing multiple images or video frames. The optimization is most beneficial for:test_one_thresholdandtest_boxes_with_zero_area)✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-nms-mkowazduand push.