study: correctness harness for Triton preproc vs F.interpolate#34
Open
aseembits93 wants to merge 3 commits into
Open
Conversation
…olate
Adds a set of isolated correctness harnesses used to characterize the
Triton preproc kernel against the PyTorch F.interpolate reference path.
The change also reorders the kernel's normalization to match PyTorch's
op order exactly (which tightens per-pixel drift to ½ ULP of fp32).
Harnesses under development/stream_interface/:
correctness_dump.py / correctness_check.py
InferencePipeline-based dumper + diff; toggles
RFDETR_USE_TRITON_PREPROC or RFDETR_TRITON_FULLPOSTPROC. Uses
greedy IoU-0.5 pairing so det-count shifts don't cascade into
false mismatches.
correctness_direct_{dump,check}.py
Drives RFDetrForInstanceSegmentationTRT directly via AutoModel,
choosing numpy input (adapter dispatches to cv2.resize) or torch
tensor input (adapter dispatches to F.interpolate).
correctness_preproc_only_{dump,check}.py
Isolates the preproc change: both runs do forward()+post_process()
identically; only the function that fills the (1,3,H,W) fp32
input tensor differs. Uses a fresh allocation each call so
neither path inherits the fast-path's in-place-into-TRT-input
buffer trick.
coco_preproc_only_{dump,check}.py
Preproc-only study over the full COCO val2017 (5000 images).
Emits both a per-image detection digest (for pairwise diff) and a
COCO-formatted detections JSON (for pycocotools bbox+segm mAP).
preproc_parity_probe.py
Pixel-level fp32 diff of the preprocessed tensor alone.
Kernel change (triton_preprocess.py):
Compute `(x / 255 - mean) / std` in the same op order as the
PyTorch reference instead of the fused `x * (1/(255*std)) +
(-mean/std)`. Mathematically equivalent but differs at the LSB;
fp16 engines can round those ULPs to different values. Per-pixel
max diff drops 9.5e-7 -> 4.8e-7 (single fp32 ULP).
Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, 281k dets):
Detection diff (conf >= 0.4):
26038/26464 matched pairs (98.4%)
mean conf delta 5.9e-3, p95 2.3e-2, max 0.418
mask-md5 mismatches 91% (boundary-pixel flips)
COCO mAP (F.interpolate vs Triton preproc):
bbox AP 50:95 0.4763 -> 0.4757 (delta -0.0006)
segm AP 50:95 0.3959 -> 0.3957 (delta -0.0001)
Bit-exact parity with F.interpolate isn't reachable without matching
PyTorch's upsample_bilinear2d tap-accumulation order (cuDNN-version
dependent) or rebuilding the TRT engine in fp32. Current preproc is
eval-equivalent on COCO: aggregate mAP shift is in the 4th decimal,
smaller than typical fp16 TRT run-to-run noise.
…resize
The PR previously characterized Triton only against the F.interpolate
tensor-input path. The numpy-input production path routes through
cv2.resize, so this commit adds a cv2 reference mode to the harness so
we can quantify Triton vs the actual legacy production preproc as well.
preproc_parity_probe.py
Adds cv2_reference() (INTER_LINEAR -> BGR->RGB -> /255-mean/std) and
prints three diff reports: triton vs F.interpolate, triton vs
cv2.resize, F.interpolate vs cv2.resize.
coco_preproc_only_dump.py
New --preproc cv2 option that mirrors the numpy-input adapter path.
coco_preproc_only_check.py
--preprocs now takes a comma list (default ref,cv2,triton). Runs one
dump per variant, pair-diffs every non-triton variant vs triton, and
prints a wide side-by-side mAP table with deltas vs the F.interpolate
reference. Label threaded through diff_detection_dumps() so the two
diff sections are distinguishable.
Findings on COCO val2017 (rfdetr-seg-nano, 5000 images, conf 0.05):
Preproc tensor parity (512x512 probe):
triton vs F.interpolate max 4.8e-7 mean 4.7e-8 (~1/2 ULP fp32)
triton vs cv2.resize max 1.3e-2 mean 3.9e-3
F.interpolate vs cv2.resize max 1.3e-2 mean 3.9e-3
The cv2 gap is essentially the PyTorch-vs-OpenCV bilinear
disagreement, not a Triton kernel issue: Triton is an order of
magnitude closer to F.interpolate than F.interpolate is to cv2.
COCO mAP:
metric ref cv2 triton d.cv2 d.triton
bbox AP 50:95 0.4763 0.4753 0.4757 -0.0010 -0.0006
bbox AP 50 0.6609 0.6604 0.6610 -0.0005 +0.0001
segm AP 50:95 0.3959 0.3955 0.3957 -0.0004 -0.0001
segm AP 50 0.6141 0.6141 0.6141 -0.0001 0.0000
Detection diff at conf >= 0.4:
ref-vs-triton: 98.4% matched, mean d.conf 5.9e-3, mask-md5 91.0%
cv2-vs-triton: 97.8% matched, mean d.conf 7.5e-3, mask-md5 94.5%
Takeaway: Triton is closer to the F.interpolate reference than
cv2.resize is, on every aggregate metric. All three paths agree on
mAP in the 3rd-4th decimal, within fp16 TRT run-to-run noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cv2.resize is the production preproc path that Triton is replacing, so the side-by-side table should read "does triton regress cv2?" not "does triton regress F.interpolate?". Default --preprocs order is now cv2,ref,triton (cv2 first so it anchors the leftmost data column), and the baseline for delta columns is cv2 if present (falls back to the first non-triton variant otherwise). No new data collected; this is a presentation change on top of the existing three-variant COCO dumps. With cv2 as baseline: metric cv2 ref triton d.ref d.triton bbox AP 50:95 0.4753 0.4763 0.4757 +0.0010 +0.0004 bbox AP 50 0.6604 0.6609 0.6610 +0.0005 +0.0006 segm AP 50:95 0.3955 0.3959 0.3957 +0.0004 +0.0002 segm AP 50 0.6141 0.6141 0.6141 +0.0001 +0.0000 Triton ties or wins cv2 on every aggregate mAP metric except segm.75 (-0.0001), all within fp16 TRT run-to-run noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
development/stream_interface/for comparing the Triton preproc kernel against cv2.resize (the production numpy-input preproc path) andF.interpolate(the tensor-input path, also used at training time). Ends in a full COCO val2017 study covering all three variants.triton_preprocess.pyto match PyTorch's op order exactly ((x/255 - mean)/std). Mathematically equivalent to the prior fused form but differs at the LSB; halves per-pixel drift from 9.5e-7 → 4.8e-7 (single fp32 ULP) againstF.interpolate.Reference path:
cv2.resizeis the production preproc that Triton is replacing, so it is the baseline below.F.interpolateis reported alongside as a secondary reference because RF-DETR is trained against it — i.e. it's the bilinear kernel the model has actually seen.Harnesses added
correctness_{dump,check}.py—InferencePipeline-based driver that togglesRFDETR_USE_TRITON_PREPROCorRFDETR_TRITON_FULLPOSTPROC. Uses greedy IoU-0.5 pairing so det-count shifts don't cascade into false mismatches.correctness_direct_{dump,check}.py— drivesRFDetrForInstanceSegmentationTRTdirectly viaAutoModel; numpy input routes tocv2.resize, tensor input routes toF.interpolate.correctness_preproc_only_{dump,check}.py— isolates the preproc change: both runs callforward() + post_process()identically; only the function filling the(1,3,H,W)fp32 input tensor differs. Fresh allocation each call so neither inherits the fast-path's in-place-into-TRT-input trick.coco_preproc_only_{dump,check}.py— preproc-only study over all 5000 COCO val2017 images.--preproc {cv2,ref,triton}selects the variant; the driver runs one dump per variant (defaultcv2,ref,triton), pair-diffs each non-triton variant against triton, and prints a side-by-side mAP table with deltas vs the cv2 baseline.preproc_parity_probe.py— pixel-level fp32 diff of the preprocessed tensor against bothcv2.resizeandF.interpolate.Findings (COCO val2017, rfdetr-seg-nano, 5000 images, ~281k dets @ conf 0.05)
Preproc tensor parity (512×512 probe, single frame):
Detection diff (conf ≥ 0.4):
COCO mAP (cv2 = baseline):
(Δ = variant − cv2.)
Takeaway: Triton ties or beats the cv2 production baseline on every aggregate mAP metric except
segm AP 75(−0.0001) andbbox AP S(−0.0023, roughly the run-to-run fp16 TRT noise floor). Triton is also bit-identical (½ ULP) to theF.interpolatekernel the model was trained against, which cv2 has always been ~1.3e-2 off from. Bit-exact parity with cv2 is not reachable (different bilinear tap pattern, different rounding), but eval-equivalence on COCO is demonstrated.Test plan
coco_preproc_only_check.pywith all three preproc variants on COCO val2017 (5000 images each) — three dumps completed on Tesla T4, ~12 im/s per variant.preproc_parity_probe.pyreports ≤ 1 fp32 ULP between Triton andF.interpolate, and quantifies the cv2 gap.triton_preprocess.pyis acceptable (one fused mul-add → three separate ops; perf impact on the kernel hot path should be minimal).🤖 Generated with Claude Code