Skip to content

⚡️ Speed up function multiclass_nms_class_agnostic by 15%#50

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-multiclass_nms_class_agnostic-mkow354n
Open

⚡️ Speed up function multiclass_nms_class_agnostic by 15%#50
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-multiclass_nms_class_agnostic-mkow354n

Conversation

@codeflash-ai
Copy link
Copy Markdown

@codeflash-ai codeflash-ai bot commented Jan 22, 2026

📄 15% (0.15x) speedup for multiclass_nms_class_agnostic in unstructured_inference/models/yolox.py

⏱️ Runtime : 19.0 milliseconds 16.5 milliseconds (best of 103 runs)

📝 Explanation and details

The optimized code achieves a 15% speedup by introducing three key optimizations to the NMS (Non-Maximum Suppression) algorithm:

What Changed

  1. Early exit for empty inputs: Added a check if valid_scores.size == 0 in multiclass_nms_class_agnostic and if boxes.size == 0 in nms to avoid unnecessary processing when no valid detections exist.

  2. Early loop termination: Added if order.size == 1: break to exit immediately when only one box remains, avoiding redundant slicing and intersection calculations on the final iteration.

  3. Reduced array indexing and in-place operations:

    • Introduced rem = order[1:] to compute the slice once instead of repeating order[1:] six times per iteration
    • Changed w = np.maximum(0.0, xx2 - xx1 + 1) to separate computation and clamping: w = xx2 - xx1 + 1; np.maximum(w, 0.0, out=w) to reuse memory
    • Applied the same pattern for height calculations
    • Replaced np.where(ovr <= nms_thr)[0] with direct boolean indexing: keep_mask = ovr <= nms_thr; order = rem[keep_mask]

Why It's Faster

Memory efficiency: The in-place operations (out=w, out=h) eliminate intermediate array allocations. In the original code, np.maximum(0.0, xx2 - xx1 + 1) creates a temporary array for the subtraction, then another for the maximum operation. The optimized version reuses the same memory buffer.

Reduced indexing overhead: Computing rem = order[1:] once and reusing it eliminates 5 redundant slice operations per loop iteration. With 703 iterations in the profiler results, this saves ~3,500 array slicing operations.

Early exits: The empty input checks are particularly effective for test cases where all scores fall below threshold (showing 63-71% speedup in those specific tests). The single-box early exit saves work on the final iteration of every NMS run.

Performance Impact

The annotated tests show:

  • Largest gains (63-71%) on empty/filtered inputs where early exits trigger
  • Moderate gains (13-25%) on typical workloads with 100-300 detections
  • Minimal overhead (~1-4%) on cases with minimal overlap where NMS is already fast

The optimization is most effective when the NMS loop runs many iterations or when input filtering produces empty results, making it well-suited for production object detection pipelines where score thresholds frequently eliminate weak detections.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 34 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import numba
import numpy as np
# imports
import pytest  # used for our unit tests
from unstructured_inference.models.yolox import multiclass_nms_class_agnostic

def test_empty_input_returns_empty():
    # Empty input: boxes shape (0, 4). Expect empty (0,6) output.
    boxes = np.empty((0, 4), dtype=np.float32)  # empty float32 boxes
    scores = np.empty((0, 3), dtype=np.float32)  # zero rows, 3 classes
    # Call function with any thresholds; should return empty array
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.1); dets = codeflash_output # 59.7μs -> 36.0μs (66.0% faster)

def test_all_scores_below_threshold_returns_empty():
    # Two boxes but all class scores below threshold -> no detections
    boxes = np.array([[0.0, 0.0, 5.0, 5.0], [10.0, 10.0, 15.0, 15.0]], dtype=np.float64)
    # all scores are small
    scores = np.array([[0.01, 0.02], [0.03, 0.04]], dtype=np.float64)
    # threshold above any score
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.1); dets = codeflash_output # 62.4μs -> 36.4μs (71.5% faster)

def test_basic_non_overlapping_keeps_all_and_correct_class_assignment():
    # Three non-overlapping boxes; each has a clear best class
    boxes = np.array([
        [0.0, 0.0, 10.0, 10.0],
        [20.0, 20.0, 30.0, 30.0],
        [40.0, 40.0, 50.0, 50.0]
    ], dtype=np.float64)
    # per-box class scores; maxima are unambiguous
    scores = np.array([
        [0.1, 0.9],  # class 1 best
        [0.8, 0.2],  # class 0 best
        [0.3, 0.4]   # class 1 best
    ], dtype=np.float64)
    # use score threshold allowing all to pass
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.2); dets = codeflash_output # 153μs -> 135μs (13.3% faster)
    # dets columns: x1,y1,x2,y2,score,class_index(as float)
    # Check scores roughly equal to expected maxima
    expected_scores = [0.9, 0.8, 0.4]
    for i in range(3):
        pass
    # Check class indices (cast to float in output)
    expected_classes = [1.0, 0.0, 1.0]
    for i in range(3):
        pass

def test_tie_between_classes_prefers_lower_index():
    # Single box where two classes tie for best score -> code should pick the first (lower index)
    boxes = np.array([[0.0, 0.0, 5.0, 5.0]], dtype=np.float64)
    scores = np.array([[0.5, 0.5, 0.2]], dtype=np.float64)  # tie for class 0 and class 1
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.1); dets = codeflash_output # 101μs -> 64.0μs (58.1% faster)

def test_nms_suppresses_overlapping_boxes_class_agnostic():
    # Two heavily overlapping boxes; higher scored one should be kept
    boxes = np.array([
        [0.0, 0.0, 10.0, 10.0],
        [1.0, 1.0, 11.0, 11.0]
    ], dtype=np.float64)
    # set different class distributions but same top class; second box has lower max score
    scores = np.array([
        [0.9, 0.1],  # box0 best = 0.9
        [0.8, 0.2]   # box1 best = 0.8
    ], dtype=np.float64)
    # low nms_thr ensures overlapping boxes are suppressed
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.3, score_thr=0.0); dets = codeflash_output # 107μs -> 104μs (3.33% faster)

def test_large_scale_consistency_and_properties():
    # Large but bounded number of boxes to check scalability and deterministic behavior
    rng = np.random.default_rng(seed=42)  # deterministic RNG
    n = 300  # within the specified upper bound (<1000)
    # Generate centers on a coarse grid so many boxes overlap
    xs = rng.uniform(0, 100, size=n)
    ys = rng.uniform(0, 100, size=n)
    # small boxes around each center
    w = 8.0
    h = 8.0
    boxes = np.empty((n, 4), dtype=np.float64)
    for i in range(n):
        x = xs[i]
        y = ys[i]
        boxes[i, 0] = x - w / 2.0
        boxes[i, 1] = y - h / 2.0
        boxes[i, 2] = x + w / 2.0
        boxes[i, 3] = y + h / 2.0
    # Create three classes with random scores but ensure some below threshold
    num_classes = 3
    scores = rng.uniform(0.0, 1.0, size=(n, num_classes)).astype(np.float64)
    # enforce that roughly half are below score threshold by scaling
    scores = scores * 1.0  # keep full range but threshold will filter
    score_thr = 0.5
    nms_thr = 0.45
    # First call (will compile the numba functions on first use)
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=nms_thr, score_thr=score_thr); dets1 = codeflash_output # 4.91ms -> 4.30ms (14.3% faster)
    # Second call should produce identical results (determinism)
    codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=nms_thr, score_thr=score_thr); dets2 = codeflash_output # 4.82ms -> 4.25ms (13.4% faster)
    flat1 = dets1.ravel()
    flat2 = dets2.ravel()
    for a, b in zip(flat1, flat2):
        pass
    # All returned scores must be above threshold
    for i in range(dets1.shape[0]):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numba
import numpy as np
import pytest
from unstructured_inference.models.yolox import multiclass_nms_class_agnostic

class TestMulticlassNmsClassAgnosticBasic:
    """Basic test cases for multiclass_nms_class_agnostic function."""
    
    def test_empty_input(self):
        """Test with empty input arrays - should return empty result."""
        boxes = np.empty((0, 4), dtype=np.float32)
        scores = np.empty((0, 3), dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 58.2μs -> 34.8μs (67.1% faster)
    
    def test_single_detection_above_threshold(self):
        """Test with a single detection that is above the score threshold."""
        boxes = np.array([[10.0, 20.0, 30.0, 40.0]], dtype=np.float32)
        scores = np.array([[0.1, 0.8, 0.1]], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 103μs -> 66.4μs (56.2% faster)
    
    def test_single_detection_below_threshold(self):
        """Test with a single detection that is below the score threshold."""
        boxes = np.array([[10.0, 20.0, 30.0, 40.0]], dtype=np.float32)
        scores = np.array([[0.1, 0.3, 0.1]], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 59.0μs -> 36.2μs (63.2% faster)
    
    def test_multiple_detections_no_overlap(self):
        """Test with multiple detections that do not overlap."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0],
            [100.0, 120.0, 130.0, 140.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.2, 0.8, 0.0],
            [0.3, 0.7, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 154μs -> 129μs (19.8% faster)
    
    def test_multiple_detections_with_overlap(self):
        """Test with overlapping detections - NMS should remove some."""
        boxes = np.array([
            [10.0, 20.0, 50.0, 60.0],
            [15.0, 25.0, 55.0, 65.0],
            [100.0, 120.0, 130.0, 140.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.2, 0.85, 0.0],
            [0.3, 0.7, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.3, score_thr=0.5); result = codeflash_output # 130μs -> 107μs (21.7% faster)
    
    def test_all_detections_below_threshold(self):
        """Test when all detections are below the score threshold."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.4, 0.4, 0.2],
            [0.3, 0.3, 0.4]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 58.4μs -> 35.9μs (62.6% faster)
    
    def test_multiclass_correct_class_selection(self):
        """Test that the highest score class is correctly selected."""
        boxes = np.array([[10.0, 20.0, 30.0, 40.0]], dtype=np.float32)
        scores = np.array([[0.1, 0.3, 0.8]], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 103μs -> 63.5μs (62.3% faster)
    
    def test_float64_dtype(self):
        """Test that the function works with float64 dtype."""
        boxes = np.array([[10.0, 20.0, 30.0, 40.0]], dtype=np.float64)
        scores = np.array([[0.1, 0.8, 0.1]], dtype=np.float64)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 101μs -> 63.1μs (61.2% faster)

class TestMulticlassNmsClassAgnosticEdge:
    """Edge case test cases for multiclass_nms_class_agnostic function."""
    
    def test_zero_score_threshold(self):
        """Test with score threshold set to zero - should include all detections."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.01, 0.0],
            [0.0, 0.001, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.0); result = codeflash_output # 128μs -> 108μs (18.5% faster)
    
    def test_high_score_threshold(self):
        """Test with very high score threshold - may filter out all detections."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.8, 0.1],
            [0.2, 0.7, 0.1]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.95); result = codeflash_output # 59.1μs -> 36.1μs (63.4% faster)
    
    def test_zero_nms_threshold(self):
        """Test with NMS threshold set to zero - very strict NMS."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [10.01, 20.01, 30.01, 40.01]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.89, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.0, score_thr=0.5); result = codeflash_output # 109μs -> 105μs (3.54% faster)
    
    def test_one_nms_threshold(self):
        """Test with NMS threshold set to one - no NMS suppression."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [10.01, 20.01, 30.01, 40.01],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.89, 0.0],
            [0.1, 0.88, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=1.0, score_thr=0.5); result = codeflash_output # 151μs -> 126μs (19.7% faster)
    
    def test_single_class(self):
        """Test with single class only."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.8],
            [0.9]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 128μs -> 106μs (21.1% faster)
    
    def test_many_classes(self):
        """Test with many classes."""
        boxes = np.array([[10.0, 20.0, 30.0, 40.0]], dtype=np.float32)
        # Create scores with 100 classes
        scores = np.random.rand(1, 100).astype(np.float32)
        scores[0, 50] = 0.95  # Set a high score at index 50
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 100μs -> 71.8μs (40.6% faster)
    
    def test_identical_box_coordinates(self):
        """Test with identical box coordinates but different classes."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [10.0, 20.0, 30.0, 40.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.8, 0.1],
            [0.1, 0.7, 0.2]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 106μs -> 108μs (1.92% slower)
    
    def test_very_small_boxes(self):
        """Test with very small bounding boxes."""
        boxes = np.array([
            [10.0, 20.0, 10.1, 20.1],
            [30.0, 40.0, 30.1, 40.1]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.8, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 127μs -> 109μs (16.9% faster)
    
    def test_very_large_boxes(self):
        """Test with very large bounding boxes."""
        boxes = np.array([
            [0.0, 0.0, 1e6, 1e6],
            [1e5, 1e5, 9e5, 9e5]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.85, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.3, score_thr=0.5); result = codeflash_output # 108μs -> 108μs (0.592% faster)
    
    def test_negative_box_coordinates(self):
        """Test with negative bounding box coordinates."""
        boxes = np.array([
            [-50.0, -40.0, -10.0, -5.0],
            [-60.0, -50.0, -20.0, -10.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.8, 0.0]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 129μs -> 109μs (18.6% faster)
    
    def test_mixed_positive_negative_coordinates(self):
        """Test with mixed positive and negative coordinates."""
        boxes = np.array([
            [-10.0, -20.0, 10.0, 20.0],
            [5.0, 10.0, 25.0, 30.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.85, 0.05],
            [0.1, 0.8, 0.1]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 128μs -> 106μs (20.6% faster)
    
    def test_equal_scores_same_class(self):
        """Test with equal scores for the same class."""
        boxes = np.array([
            [10.0, 20.0, 30.0, 40.0],
            [50.0, 60.0, 70.0, 80.0]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.8, 0.1],
            [0.1, 0.8, 0.1]
        ], dtype=np.float32)
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 129μs -> 108μs (18.8% faster)

class TestMulticlassNmsClassAgnosticLargeScale:
    """Large-scale test cases for multiclass_nms_class_agnostic function."""
    
    def test_many_detections(self):
        """Test with a large number of detections."""
        # Create 100 non-overlapping detections
        n_detections = 100
        boxes = np.zeros((n_detections, 4), dtype=np.float32)
        scores = np.zeros((n_detections, 10), dtype=np.float32)
        
        for i in range(n_detections):
            # Create non-overlapping boxes in a grid
            row = i // 10
            col = i % 10
            x1 = col * 100.0
            y1 = row * 100.0
            boxes[i] = [x1, y1, x1 + 50.0, y1 + 50.0]
            
            # Assign random scores with one high score per detection
            scores[i] = np.random.rand(10).astype(np.float32) * 0.4
            scores[i, np.random.randint(0, 10)] = np.random.rand(1).astype(np.float32) * 0.6 + 0.4
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 1.76ms -> 1.53ms (14.8% faster)
    
    def test_many_classes(self):
        """Test with a large number of classes."""
        # Create detections with 100 classes
        boxes = np.array([
            [10.0 + i * 100, 20.0, 30.0 + i * 100, 40.0]
            for i in range(50)
        ], dtype=np.float32)
        
        scores = np.random.rand(50, 100).astype(np.float32)
        # Set high scores for class selection
        for i in range(50):
            scores[i, np.random.randint(0, 100)] = np.random.rand(1).astype(np.float32) * 0.6 + 0.4
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 1.12ms -> 1.00ms (12.2% faster)
    
    def test_dense_overlapping_detections(self):
        """Test with many overlapping detections in a small area."""
        # Create 50 detections with heavy overlap
        n_detections = 50
        boxes = np.zeros((n_detections, 4), dtype=np.float32)
        scores = np.zeros((n_detections, 5), dtype=np.float32)
        
        for i in range(n_detections):
            # All boxes overlap heavily in the same region
            offset_x = np.random.rand(1).astype(np.float32)[0] * 20.0
            offset_y = np.random.rand(1).astype(np.float32)[0] * 20.0
            boxes[i] = [10.0 + offset_x, 20.0 + offset_y, 50.0 + offset_x, 60.0 + offset_y]
            
            # Assign random scores
            scores[i] = np.random.rand(5).astype(np.float32)
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.3, score_thr=0.5); result = codeflash_output # 110μs -> 105μs (4.27% faster)
    
    def test_mixed_overlapping_and_non_overlapping(self):
        """Test with a mix of overlapping and non-overlapping detections."""
        boxes = np.zeros((200, 4), dtype=np.float32)
        scores = np.zeros((200, 10), dtype=np.float32)
        
        # Create clusters of overlapping boxes
        for cluster in range(20):
            cluster_start = cluster * 10
            cluster_end = (cluster + 1) * 10
            
            base_x = cluster * 100.0
            base_y = (cluster % 10) * 100.0
            
            for i in range(cluster_start, cluster_end):
                offset_x = np.random.rand(1).astype(np.float32)[0] * 30.0
                offset_y = np.random.rand(1).astype(np.float32)[0] * 30.0
                boxes[i] = [
                    base_x + offset_x, base_y + offset_y,
                    base_x + 50.0 + offset_x, base_y + 50.0 + offset_y
                ]
                
                scores[i] = np.random.rand(10).astype(np.float32)
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.4, score_thr=0.5); result = codeflash_output # 1.14ms -> 1.02ms (11.8% faster)
    
    def test_large_batch_all_below_threshold(self):
        """Test with many detections but all below score threshold."""
        n_detections = 500
        boxes = np.random.rand(n_detections, 4).astype(np.float32) * 1000
        # Ensure x1 < x2 and y1 < y2
        boxes[:, 2] = boxes[:, 0] + 50
        boxes[:, 3] = boxes[:, 1] + 50
        
        scores = np.random.rand(n_detections, 20).astype(np.float32) * 0.3
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 89.0μs -> 67.2μs (32.4% faster)
    
    def test_high_resolution_coordinates(self):
        """Test with high-resolution (large value) coordinates."""
        boxes = np.array([
            [1e4, 2e4, 3e4, 4e4],
            [5e4, 6e4, 7e4, 8e4],
            [1.5e4, 2.5e4, 3.5e4, 4.5e4]
        ], dtype=np.float32)
        scores = np.array([
            [0.1, 0.9, 0.0],
            [0.1, 0.85, 0.0],
            [0.1, 0.8, 0.0]
        ], dtype=np.float32)
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.3, score_thr=0.5); result = codeflash_output # 130μs -> 104μs (25.6% faster)
    
    def test_precision_with_many_similar_scores(self):
        """Test numerical precision with many similar high scores."""
        n_detections = 100
        boxes = np.zeros((n_detections, 4), dtype=np.float32)
        scores = np.zeros((n_detections, 10), dtype=np.float32)
        
        for i in range(n_detections):
            # Create non-overlapping boxes
            row = i // 10
            col = i % 10
            boxes[i] = [col * 100, row * 100, col * 100 + 50, row * 100 + 50]
            
            # Assign similar high scores
            base_score = 0.7 + i * 0.0001
            for j in range(10):
                scores[i, j] = base_score + np.random.rand(1).astype(np.float32) * 0.05
        
        codeflash_output = multiclass_nms_class_agnostic(boxes, scores, nms_thr=0.5, score_thr=0.5); result = codeflash_output # 2.24ms -> 1.98ms (12.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-multiclass_nms_class_agnostic-mkow354n and push.

Codeflash Static Badge

The optimized code achieves a **15% speedup** by introducing three key optimizations to the NMS (Non-Maximum Suppression) algorithm:

## What Changed

1. **Early exit for empty inputs**: Added a check `if valid_scores.size == 0` in `multiclass_nms_class_agnostic` and `if boxes.size == 0` in `nms` to avoid unnecessary processing when no valid detections exist.

2. **Early loop termination**: Added `if order.size == 1: break` to exit immediately when only one box remains, avoiding redundant slicing and intersection calculations on the final iteration.

3. **Reduced array indexing and in-place operations**: 
   - Introduced `rem = order[1:]` to compute the slice once instead of repeating `order[1:]` six times per iteration
   - Changed `w = np.maximum(0.0, xx2 - xx1 + 1)` to separate computation and clamping: `w = xx2 - xx1 + 1; np.maximum(w, 0.0, out=w)` to reuse memory
   - Applied the same pattern for height calculations
   - Replaced `np.where(ovr <= nms_thr)[0]` with direct boolean indexing: `keep_mask = ovr <= nms_thr; order = rem[keep_mask]`

## Why It's Faster

**Memory efficiency**: The in-place operations (`out=w`, `out=h`) eliminate intermediate array allocations. In the original code, `np.maximum(0.0, xx2 - xx1 + 1)` creates a temporary array for the subtraction, then another for the maximum operation. The optimized version reuses the same memory buffer.

**Reduced indexing overhead**: Computing `rem = order[1:]` once and reusing it eliminates 5 redundant slice operations per loop iteration. With 703 iterations in the profiler results, this saves ~3,500 array slicing operations.

**Early exits**: The empty input checks are particularly effective for test cases where all scores fall below threshold (showing 63-71% speedup in those specific tests). The single-box early exit saves work on the final iteration of every NMS run.

## Performance Impact

The annotated tests show:
- **Largest gains (63-71%)** on empty/filtered inputs where early exits trigger
- **Moderate gains (13-25%)** on typical workloads with 100-300 detections  
- **Minimal overhead (~1-4%)** on cases with minimal overlap where NMS is already fast

The optimization is most effective when the NMS loop runs many iterations or when input filtering produces empty results, making it well-suited for production object detection pipelines where score thresholds frequently eliminate weak detections.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 22, 2026 03:25
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants