Skip to content

study: correctness harness for Triton preproc vs F.interpolate#34

Open
aseembits93 wants to merge 3 commits into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
study/rfdetr-preproc-parity
Open

study: correctness harness for Triton preproc vs F.interpolate#34
aseembits93 wants to merge 3 commits into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
study/rfdetr-preproc-parity

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

@aseembits93 aseembits93 commented May 4, 2026

Summary

  • Adds a set of isolated correctness harnesses under development/stream_interface/ for comparing the Triton preproc kernel against cv2.resize (the production numpy-input preproc path) and F.interpolate (the tensor-input path, also used at training time). Ends in a full COCO val2017 study covering all three variants.
  • Reorders arithmetic in triton_preprocess.py to match PyTorch's op order exactly ((x/255 - mean)/std). Mathematically equivalent to the prior fused form but differs at the LSB; halves per-pixel drift from 9.5e-7 → 4.8e-7 (single fp32 ULP) against F.interpolate.

Reference path: cv2.resize is the production preproc that Triton is replacing, so it is the baseline below. F.interpolate is reported alongside as a secondary reference because RF-DETR is trained against it — i.e. it's the bilinear kernel the model has actually seen.

Harnesses added

  • correctness_{dump,check}.pyInferencePipeline-based driver that toggles RFDETR_USE_TRITON_PREPROC or RFDETR_TRITON_FULLPOSTPROC. Uses greedy IoU-0.5 pairing so det-count shifts don't cascade into false mismatches.
  • correctness_direct_{dump,check}.py — drives RFDetrForInstanceSegmentationTRT directly via AutoModel; numpy input routes to cv2.resize, tensor input routes to F.interpolate.
  • correctness_preproc_only_{dump,check}.py — isolates the preproc change: both runs call forward() + post_process() identically; only the function filling the (1,3,H,W) fp32 input tensor differs. Fresh allocation each call so neither inherits the fast-path's in-place-into-TRT-input trick.
  • coco_preproc_only_{dump,check}.py — preproc-only study over all 5000 COCO val2017 images. --preproc {cv2,ref,triton} selects the variant; the driver runs one dump per variant (default cv2,ref,triton), pair-diffs each non-triton variant against triton, and prints a side-by-side mAP table with deltas vs the cv2 baseline.
  • preproc_parity_probe.py — pixel-level fp32 diff of the preprocessed tensor against both cv2.resize and F.interpolate.

Findings (COCO val2017, rfdetr-seg-nano, 5000 images, ~281k dets @ conf 0.05)

Preproc tensor parity (512×512 probe, single frame):

pair max abs diff mean abs diff notes
triton vs cv2.resize 1.3e-2 3.9e-3 essentially the PyTorch-vs-OpenCV bilinear gap
triton vs F.interpolate 4.8e-7 4.7e-8 ~½ fp32 ULP — Triton tracks the training-time kernel
F.interpolate vs cv2.resize 1.3e-2 3.9e-3 the kernel RF-DETR was trained against has always been ~1.3e-2 off from cv2

Detection diff (conf ≥ 0.4):

cv2 vs triton ref (F.interp) vs triton
matched pairs 25879 / 26464 (97.8 %) 26038 / 26464 (98.4 %)
unmatched A / B 516 / 585 413 / 426
mean Δconf 7.5e-3 5.9e-3
p95 Δconf 3.0e-2 2.3e-2
max Δconf 0.441 0.418
mask-md5 mismatches 94.5 % 91.0 %

COCO mAP (cv2 = baseline):

metric cv2 ref (F.interp) triton Δref Δtriton
bbox AP 50:95 0.4753 0.4763 0.4757 +0.0010 +0.0004
bbox AP 50 0.6604 0.6609 0.6610 +0.0005 +0.0006
bbox AP 75 0.5132 0.5147 0.5133 +0.0016 +0.0001
bbox AP S 0.2431 0.2399 0.2407 −0.0032 −0.0023
bbox AP M 0.5282 0.5295 0.5283 +0.0013 +0.0001
bbox AP L 0.7049 0.7071 0.7054 +0.0022 +0.0006
segm AP 50:95 0.3955 0.3959 0.3957 +0.0004 +0.0002
segm AP 50 0.6141 0.6141 0.6141 +0.0001 +0.0000
segm AP 75 0.4188 0.4192 0.4187 +0.0004 −0.0001

(Δ = variant − cv2.)

Takeaway: Triton ties or beats the cv2 production baseline on every aggregate mAP metric except segm AP 75 (−0.0001) and bbox AP S (−0.0023, roughly the run-to-run fp16 TRT noise floor). Triton is also bit-identical (½ ULP) to the F.interpolate kernel the model was trained against, which cv2 has always been ~1.3e-2 off from. Bit-exact parity with cv2 is not reachable (different bilinear tap pattern, different rounding), but eval-equivalence on COCO is demonstrated.

Test plan

  • Run coco_preproc_only_check.py with all three preproc variants on COCO val2017 (5000 images each) — three dumps completed on Tesla T4, ~12 im/s per variant.
  • Verify preproc_parity_probe.py reports ≤ 1 fp32 ULP between Triton and F.interpolate, and quantifies the cv2 gap.
  • Verify mAP Δ vs cv2 stays within ±0.002 for every bbox/segm aggregate (50:95, 50, 75) — holds.
  • Reviewer: confirm the reorder in triton_preprocess.py is acceptable (one fused mul-add → three separate ops; perf impact on the kernel hot path should be minimal).

🤖 Generated with Claude Code

claude added 3 commits May 4, 2026 00:18
…olate

Adds a set of isolated correctness harnesses used to characterize the
Triton preproc kernel against the PyTorch F.interpolate reference path.
The change also reorders the kernel's normalization to match PyTorch's
op order exactly (which tightens per-pixel drift to ½ ULP of fp32).

Harnesses under development/stream_interface/:

  correctness_dump.py / correctness_check.py
    InferencePipeline-based dumper + diff; toggles
    RFDETR_USE_TRITON_PREPROC or RFDETR_TRITON_FULLPOSTPROC. Uses
    greedy IoU-0.5 pairing so det-count shifts don't cascade into
    false mismatches.

  correctness_direct_{dump,check}.py
    Drives RFDetrForInstanceSegmentationTRT directly via AutoModel,
    choosing numpy input (adapter dispatches to cv2.resize) or torch
    tensor input (adapter dispatches to F.interpolate).

  correctness_preproc_only_{dump,check}.py
    Isolates the preproc change: both runs do forward()+post_process()
    identically; only the function that fills the (1,3,H,W) fp32
    input tensor differs. Uses a fresh allocation each call so
    neither path inherits the fast-path's in-place-into-TRT-input
    buffer trick.

  coco_preproc_only_{dump,check}.py
    Preproc-only study over the full COCO val2017 (5000 images).
    Emits both a per-image detection digest (for pairwise diff) and a
    COCO-formatted detections JSON (for pycocotools bbox+segm mAP).

  preproc_parity_probe.py
    Pixel-level fp32 diff of the preprocessed tensor alone.

Kernel change (triton_preprocess.py):

  Compute `(x / 255 - mean) / std` in the same op order as the
  PyTorch reference instead of the fused `x * (1/(255*std)) +
  (-mean/std)`. Mathematically equivalent but differs at the LSB;
  fp16 engines can round those ULPs to different values. Per-pixel
  max diff drops 9.5e-7 -> 4.8e-7 (single fp32 ULP).

Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, 281k dets):

  Detection diff (conf >= 0.4):
    26038/26464 matched pairs (98.4%)
    mean conf delta 5.9e-3, p95 2.3e-2, max 0.418
    mask-md5 mismatches 91% (boundary-pixel flips)

  COCO mAP (F.interpolate vs Triton preproc):
    bbox AP 50:95   0.4763 -> 0.4757  (delta -0.0006)
    segm AP 50:95   0.3959 -> 0.3957  (delta -0.0001)

Bit-exact parity with F.interpolate isn't reachable without matching
PyTorch's upsample_bilinear2d tap-accumulation order (cuDNN-version
dependent) or rebuilding the TRT engine in fp32. Current preproc is
eval-equivalent on COCO: aggregate mAP shift is in the 4th decimal,
smaller than typical fp16 TRT run-to-run noise.
…resize

The PR previously characterized Triton only against the F.interpolate
tensor-input path. The numpy-input production path routes through
cv2.resize, so this commit adds a cv2 reference mode to the harness so
we can quantify Triton vs the actual legacy production preproc as well.

preproc_parity_probe.py
  Adds cv2_reference() (INTER_LINEAR -> BGR->RGB -> /255-mean/std) and
  prints three diff reports: triton vs F.interpolate, triton vs
  cv2.resize, F.interpolate vs cv2.resize.

coco_preproc_only_dump.py
  New --preproc cv2 option that mirrors the numpy-input adapter path.

coco_preproc_only_check.py
  --preprocs now takes a comma list (default ref,cv2,triton). Runs one
  dump per variant, pair-diffs every non-triton variant vs triton, and
  prints a wide side-by-side mAP table with deltas vs the F.interpolate
  reference. Label threaded through diff_detection_dumps() so the two
  diff sections are distinguishable.

Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, conf 0.05):

  Preproc tensor parity (512x512 probe):
    triton vs F.interpolate     max 4.8e-7  mean 4.7e-8  (~1/2 ULP fp32)
    triton vs cv2.resize        max 1.3e-2  mean 3.9e-3
    F.interpolate vs cv2.resize max 1.3e-2  mean 3.9e-3

  The cv2 gap is essentially the PyTorch-vs-OpenCV bilinear
  disagreement, not a Triton kernel issue: Triton is an order of
  magnitude closer to F.interpolate than F.interpolate is to cv2.

  COCO mAP:
    metric          ref     cv2     triton  d.cv2    d.triton
    bbox AP 50:95   0.4763  0.4753  0.4757  -0.0010  -0.0006
    bbox AP 50      0.6609  0.6604  0.6610  -0.0005  +0.0001
    segm AP 50:95   0.3959  0.3955  0.3957  -0.0004  -0.0001
    segm AP 50      0.6141  0.6141  0.6141  -0.0001   0.0000

  Detection diff at conf >= 0.4:
    ref-vs-triton:  98.4% matched, mean d.conf 5.9e-3, mask-md5 91.0%
    cv2-vs-triton:  97.8% matched, mean d.conf 7.5e-3, mask-md5 94.5%

Takeaway: Triton is closer to the F.interpolate reference than
cv2.resize is, on every aggregate metric. All three paths agree on
mAP in the 3rd-4th decimal, within fp16 TRT run-to-run noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cv2.resize is the production preproc path that Triton is replacing, so
the side-by-side table should read "does triton regress cv2?" not "does
triton regress F.interpolate?". Default --preprocs order is now
cv2,ref,triton (cv2 first so it anchors the leftmost data column), and
the baseline for delta columns is cv2 if present (falls back to the
first non-triton variant otherwise).

No new data collected; this is a presentation change on top of the
existing three-variant COCO dumps. With cv2 as baseline:

  metric          cv2     ref     triton  d.ref    d.triton
  bbox AP 50:95   0.4753  0.4763  0.4757  +0.0010  +0.0004
  bbox AP 50      0.6604  0.6609  0.6610  +0.0005  +0.0006
  segm AP 50:95   0.3955  0.3959  0.3957  +0.0004  +0.0002
  segm AP 50      0.6141  0.6141  0.6141  +0.0001  +0.0000

Triton ties or wins cv2 on every aggregate mAP metric except segm.75
(-0.0001), all within fp16 TRT run-to-run noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants