⚡️ Speed up function multiclass_nms_class_agnostic by 15%#50
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up function multiclass_nms_class_agnostic by 15%#50codeflash-ai[bot] wants to merge 1 commit intomainfrom
multiclass_nms_class_agnostic by 15%#50codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves a **15% speedup** by introducing three key optimizations to the NMS (Non-Maximum Suppression) algorithm: ## What Changed 1. **Early exit for empty inputs**: Added a check `if valid_scores.size == 0` in `multiclass_nms_class_agnostic` and `if boxes.size == 0` in `nms` to avoid unnecessary processing when no valid detections exist. 2. **Early loop termination**: Added `if order.size == 1: break` to exit immediately when only one box remains, avoiding redundant slicing and intersection calculations on the final iteration. 3. **Reduced array indexing and in-place operations**: - Introduced `rem = order[1:]` to compute the slice once instead of repeating `order[1:]` six times per iteration - Changed `w = np.maximum(0.0, xx2 - xx1 + 1)` to separate computation and clamping: `w = xx2 - xx1 + 1; np.maximum(w, 0.0, out=w)` to reuse memory - Applied the same pattern for height calculations - Replaced `np.where(ovr <= nms_thr)[0]` with direct boolean indexing: `keep_mask = ovr <= nms_thr; order = rem[keep_mask]` ## Why It's Faster **Memory efficiency**: The in-place operations (`out=w`, `out=h`) eliminate intermediate array allocations. In the original code, `np.maximum(0.0, xx2 - xx1 + 1)` creates a temporary array for the subtraction, then another for the maximum operation. The optimized version reuses the same memory buffer. **Reduced indexing overhead**: Computing `rem = order[1:]` once and reusing it eliminates 5 redundant slice operations per loop iteration. With 703 iterations in the profiler results, this saves ~3,500 array slicing operations. **Early exits**: The empty input checks are particularly effective for test cases where all scores fall below threshold (showing 63-71% speedup in those specific tests). The single-box early exit saves work on the final iteration of every NMS run. ## Performance Impact The annotated tests show: - **Largest gains (63-71%)** on empty/filtered inputs where early exits trigger - **Moderate gains (13-25%)** on typical workloads with 100-300 detections - **Minimal overhead (~1-4%)** on cases with minimal overlap where NMS is already fast The optimization is most effective when the NMS loop runs many iterations or when input filtering produces empty results, making it well-suited for production object detection pipelines where score thresholds frequently eliminate weak detections.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 15% (0.15x) speedup for
multiclass_nms_class_agnosticinunstructured_inference/models/yolox.py⏱️ Runtime :
19.0 milliseconds→16.5 milliseconds(best of103runs)📝 Explanation and details
The optimized code achieves a 15% speedup by introducing three key optimizations to the NMS (Non-Maximum Suppression) algorithm:
What Changed
Early exit for empty inputs: Added a check
if valid_scores.size == 0inmulticlass_nms_class_agnosticandif boxes.size == 0innmsto avoid unnecessary processing when no valid detections exist.Early loop termination: Added
if order.size == 1: breakto exit immediately when only one box remains, avoiding redundant slicing and intersection calculations on the final iteration.Reduced array indexing and in-place operations:
rem = order[1:]to compute the slice once instead of repeatingorder[1:]six times per iterationw = np.maximum(0.0, xx2 - xx1 + 1)to separate computation and clamping:w = xx2 - xx1 + 1; np.maximum(w, 0.0, out=w)to reuse memorynp.where(ovr <= nms_thr)[0]with direct boolean indexing:keep_mask = ovr <= nms_thr; order = rem[keep_mask]Why It's Faster
Memory efficiency: The in-place operations (
out=w,out=h) eliminate intermediate array allocations. In the original code,np.maximum(0.0, xx2 - xx1 + 1)creates a temporary array for the subtraction, then another for the maximum operation. The optimized version reuses the same memory buffer.Reduced indexing overhead: Computing
rem = order[1:]once and reusing it eliminates 5 redundant slice operations per loop iteration. With 703 iterations in the profiler results, this saves ~3,500 array slicing operations.Early exits: The empty input checks are particularly effective for test cases where all scores fall below threshold (showing 63-71% speedup in those specific tests). The single-box early exit saves work on the final iteration of every NMS run.
Performance Impact
The annotated tests show:
The optimization is most effective when the NMS loop runs many iterations or when input filtering produces empty results, making it well-suited for production object detection pipelines where score thresholds frequently eliminate weak detections.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-multiclass_nms_class_agnostic-mkow354nand push.