study: correctness harness for Triton preproc vs F.interpolate by aseembits93 · Pull Request #34 · aseembits93/inference

aseembits93 · 2026-05-04T00:21:10Z

Summary

Adds a set of isolated correctness harnesses under development/stream_interface/ for comparing the Triton preproc kernel against cv2.resize (the production numpy-input preproc path) and F.interpolate (the tensor-input path, also used at training time). Ends in a full COCO val2017 study covering all three variants.
Reorders arithmetic in triton_preprocess.py to match PyTorch's op order exactly ((x/255 - mean)/std). Mathematically equivalent to the prior fused form but differs at the LSB; halves per-pixel drift from 9.5e-7 → 4.8e-7 (single fp32 ULP) against F.interpolate.

Reference path: cv2.resize is the production preproc that Triton is replacing, so it is the baseline below. F.interpolate is reported alongside as a secondary reference because RF-DETR is trained against it — i.e. it's the bilinear kernel the model has actually seen.

Harnesses added

correctness_{dump,check}.py — InferencePipeline-based driver that toggles RFDETR_USE_TRITON_PREPROC or RFDETR_TRITON_FULLPOSTPROC. Uses greedy IoU-0.5 pairing so det-count shifts don't cascade into false mismatches.
correctness_direct_{dump,check}.py — drives RFDetrForInstanceSegmentationTRT directly via AutoModel; numpy input routes to cv2.resize, tensor input routes to F.interpolate.
correctness_preproc_only_{dump,check}.py — isolates the preproc change: both runs call forward() + post_process() identically; only the function filling the (1,3,H,W) fp32 input tensor differs. Fresh allocation each call so neither inherits the fast-path's in-place-into-TRT-input trick.
coco_preproc_only_{dump,check}.py — preproc-only study over all 5000 COCO val2017 images. --preproc {cv2,ref,triton} selects the variant; the driver runs one dump per variant (default cv2,ref,triton), pair-diffs each non-triton variant against triton, and prints a side-by-side mAP table with deltas vs the cv2 baseline.
preproc_parity_probe.py — pixel-level fp32 diff of the preprocessed tensor against both cv2.resize and F.interpolate.

Findings (COCO val2017, rfdetr-seg-nano, 5000 images, ~281k dets @ conf 0.05)

Preproc tensor parity (512×512 probe, single frame):

pair	max abs diff	mean abs diff	notes
triton vs cv2.resize	1.3e-2	3.9e-3	essentially the PyTorch-vs-OpenCV bilinear gap
triton vs F.interpolate	4.8e-7	4.7e-8	~½ fp32 ULP — Triton tracks the training-time kernel
F.interpolate vs cv2.resize	1.3e-2	3.9e-3	the kernel RF-DETR was trained against has always been ~1.3e-2 off from cv2

Detection diff (conf ≥ 0.4):

	cv2 vs triton	ref (F.interp) vs triton
matched pairs	25879 / 26464 (97.8 %)	26038 / 26464 (98.4 %)
unmatched A / B	516 / 585	413 / 426
mean Δconf	7.5e-3	5.9e-3
p95 Δconf	3.0e-2	2.3e-2
max Δconf	0.441	0.418
mask-md5 mismatches	94.5 %	91.0 %

COCO mAP (cv2 = baseline):

metric	cv2	ref (F.interp)	triton	Δref	Δtriton
bbox AP 50:95	0.4753	0.4763	0.4757	+0.0010	+0.0004
bbox AP 50	0.6604	0.6609	0.6610	+0.0005	+0.0006
bbox AP 75	0.5132	0.5147	0.5133	+0.0016	+0.0001
bbox AP S	0.2431	0.2399	0.2407	−0.0032	−0.0023
bbox AP M	0.5282	0.5295	0.5283	+0.0013	+0.0001
bbox AP L	0.7049	0.7071	0.7054	+0.0022	+0.0006
segm AP 50:95	0.3955	0.3959	0.3957	+0.0004	+0.0002
segm AP 50	0.6141	0.6141	0.6141	+0.0001	+0.0000
segm AP 75	0.4188	0.4192	0.4187	+0.0004	−0.0001

(Δ = variant − cv2.)

Takeaway: Triton ties or beats the cv2 production baseline on every aggregate mAP metric except segm AP 75 (−0.0001) and bbox AP S (−0.0023, roughly the run-to-run fp16 TRT noise floor). Triton is also bit-identical (½ ULP) to the F.interpolate kernel the model was trained against, which cv2 has always been ~1.3e-2 off from. Bit-exact parity with cv2 is not reachable (different bilinear tap pattern, different rounding), but eval-equivalence on COCO is demonstrated.

Test plan

Run coco_preproc_only_check.py with all three preproc variants on COCO val2017 (5000 images each) — three dumps completed on Tesla T4, ~12 im/s per variant.
Verify preproc_parity_probe.py reports ≤ 1 fp32 ULP between Triton and F.interpolate, and quantifies the cv2 gap.
Verify mAP Δ vs cv2 stays within ±0.002 for every bbox/segm aggregate (50:95, 50, 75) — holds.
Reviewer: confirm the reorder in triton_preprocess.py is acceptable (one fused mul-add → three separate ops; perf impact on the kernel hot path should be minimal).

🤖 Generated with Claude Code

…olate Adds a set of isolated correctness harnesses used to characterize the Triton preproc kernel against the PyTorch F.interpolate reference path. The change also reorders the kernel's normalization to match PyTorch's op order exactly (which tightens per-pixel drift to ½ ULP of fp32). Harnesses under development/stream_interface/: correctness_dump.py / correctness_check.py InferencePipeline-based dumper + diff; toggles RFDETR_USE_TRITON_PREPROC or RFDETR_TRITON_FULLPOSTPROC. Uses greedy IoU-0.5 pairing so det-count shifts don't cascade into false mismatches. correctness_direct_{dump,check}.py Drives RFDetrForInstanceSegmentationTRT directly via AutoModel, choosing numpy input (adapter dispatches to cv2.resize) or torch tensor input (adapter dispatches to F.interpolate). correctness_preproc_only_{dump,check}.py Isolates the preproc change: both runs do forward()+post_process() identically; only the function that fills the (1,3,H,W) fp32 input tensor differs. Uses a fresh allocation each call so neither path inherits the fast-path's in-place-into-TRT-input buffer trick. coco_preproc_only_{dump,check}.py Preproc-only study over the full COCO val2017 (5000 images). Emits both a per-image detection digest (for pairwise diff) and a COCO-formatted detections JSON (for pycocotools bbox+segm mAP). preproc_parity_probe.py Pixel-level fp32 diff of the preprocessed tensor alone. Kernel change (triton_preprocess.py): Compute `(x / 255 - mean) / std` in the same op order as the PyTorch reference instead of the fused `x * (1/(255*std)) + (-mean/std)`. Mathematically equivalent but differs at the LSB; fp16 engines can round those ULPs to different values. Per-pixel max diff drops 9.5e-7 -> 4.8e-7 (single fp32 ULP). Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, 281k dets): Detection diff (conf >= 0.4): 26038/26464 matched pairs (98.4%) mean conf delta 5.9e-3, p95 2.3e-2, max 0.418 mask-md5 mismatches 91% (boundary-pixel flips) COCO mAP (F.interpolate vs Triton preproc): bbox AP 50:95 0.4763 -> 0.4757 (delta -0.0006) segm AP 50:95 0.3959 -> 0.3957 (delta -0.0001) Bit-exact parity with F.interpolate isn't reachable without matching PyTorch's upsample_bilinear2d tap-accumulation order (cuDNN-version dependent) or rebuilding the TRT engine in fp32. Current preproc is eval-equivalent on COCO: aggregate mAP shift is in the 4th decimal, smaller than typical fp16 TRT run-to-run noise.

…resize The PR previously characterized Triton only against the F.interpolate tensor-input path. The numpy-input production path routes through cv2.resize, so this commit adds a cv2 reference mode to the harness so we can quantify Triton vs the actual legacy production preproc as well. preproc_parity_probe.py Adds cv2_reference() (INTER_LINEAR -> BGR->RGB -> /255-mean/std) and prints three diff reports: triton vs F.interpolate, triton vs cv2.resize, F.interpolate vs cv2.resize. coco_preproc_only_dump.py New --preproc cv2 option that mirrors the numpy-input adapter path. coco_preproc_only_check.py --preprocs now takes a comma list (default ref,cv2,triton). Runs one dump per variant, pair-diffs every non-triton variant vs triton, and prints a wide side-by-side mAP table with deltas vs the F.interpolate reference. Label threaded through diff_detection_dumps() so the two diff sections are distinguishable. Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, conf 0.05): Preproc tensor parity (512x512 probe): triton vs F.interpolate max 4.8e-7 mean 4.7e-8 (~1/2 ULP fp32) triton vs cv2.resize max 1.3e-2 mean 3.9e-3 F.interpolate vs cv2.resize max 1.3e-2 mean 3.9e-3 The cv2 gap is essentially the PyTorch-vs-OpenCV bilinear disagreement, not a Triton kernel issue: Triton is an order of magnitude closer to F.interpolate than F.interpolate is to cv2. COCO mAP: metric ref cv2 triton d.cv2 d.triton bbox AP 50:95 0.4763 0.4753 0.4757 -0.0010 -0.0006 bbox AP 50 0.6609 0.6604 0.6610 -0.0005 +0.0001 segm AP 50:95 0.3959 0.3955 0.3957 -0.0004 -0.0001 segm AP 50 0.6141 0.6141 0.6141 -0.0001 0.0000 Detection diff at conf >= 0.4: ref-vs-triton: 98.4% matched, mean d.conf 5.9e-3, mask-md5 91.0% cv2-vs-triton: 97.8% matched, mean d.conf 7.5e-3, mask-md5 94.5% Takeaway: Triton is closer to the F.interpolate reference than cv2.resize is, on every aggregate metric. All three paths agree on mAP in the 3rd-4th decimal, within fp16 TRT run-to-run noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cv2.resize is the production preproc path that Triton is replacing, so the side-by-side table should read "does triton regress cv2?" not "does triton regress F.interpolate?". Default --preprocs order is now cv2,ref,triton (cv2 first so it anchors the leftmost data column), and the baseline for delta columns is cv2 if present (falls back to the first non-triton variant otherwise). No new data collected; this is a presentation change on top of the existing three-variant COCO dumps. With cv2 as baseline: metric cv2 ref triton d.ref d.triton bbox AP 50:95 0.4753 0.4763 0.4757 +0.0010 +0.0004 bbox AP 50 0.6604 0.6609 0.6610 +0.0005 +0.0006 segm AP 50:95 0.3955 0.3959 0.3957 +0.0004 +0.0002 segm AP 50 0.6141 0.6141 0.6141 +0.0001 +0.0000 Triton ties or wins cv2 on every aggregate mAP metric except segm.75 (-0.0001), all within fp16 TRT run-to-run noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude added 3 commits May 4, 2026 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

study: correctness harness for Triton preproc vs F.interpolate#34

study: correctness harness for Triton preproc vs F.interpolate#34
aseembits93 wants to merge 3 commits into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
study/rfdetr-preproc-parity

aseembits93 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Harnesses added

Findings (COCO val2017, rfdetr-seg-nano, 5000 images, ~281k dets @ conf 0.05)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented May 4, 2026 •

edited

Loading