Skip to content

test(rfdetr): expand Triton preprocess parity and validation coverage#33

Open
aseembits93 wants to merge 2 commits into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
test/rfdetr-triton-preprocess-parity
Open

test(rfdetr): expand Triton preprocess parity and validation coverage#33
aseembits93 wants to merge 2 commits into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
test/rfdetr-triton-preprocess-parity

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

Summary

Expands `test_triton_preprocess.py` from 3 tests (torch-reference parity only) to 13, grouped into:

  • Input validation (5 tests) — kernel raises `ValueError` on CPU tensor / non-uint8 dtype / CHW-instead-of-HWC shape / mismatched `out` buffer shape or dtype.
  • Buffer reuse (1 test) — kernel writes into a caller-provided `out` tensor without reallocating. The real adapter in `rfdetr_instance_segmentation_trt.py` relies on this to avoid per-frame allocations.
  • End-to-end parity against `pre_process_network_input` STRETCH_TO (3 tests) — feeds the same BGR frame through both paths (non-Triton wrapper with `resize_mode=STRETCH_TO`, `color_mode=RGB`, `normalization=(means, stds)` vs the Triton kernel) and compares. Tolerance is `2 / 255 / min(std)` — cv2's INTER_LINEAR uses 5-bit fixed-point coefficients that diverge by up to ~1.5 LSBs from Triton's fp32 pixel-center bilinear on large downscales.
  • Solid-color analytic parity (1 test) — every output pixel must equal the exact analytic normalization of the input color (no interpolation ambiguity), atol `1e-5`.

Brings total file count to 13 tests; all pass on T4.

Test plan

  • `pytest tests/unit_tests/models/rfdetr/test_triton_preprocess.py` — 13/13 pass
  • `pytest tests/unit_tests/models/rfdetr/test_pre_processing.py` — existing 16/16 still pass
  • Module-level skip keeps CPU CI green: `pytest.importorskip("triton")` + `cv2` + `torch.cuda.is_available()` gates

Base branch

Targets `perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy` (#31) since the tests import `triton_preprocess_rfdetr_stretch`, which only exists on that branch.

🤖 Generated with Claude Code

aseembits93 and others added 2 commits May 3, 2026 21:38
Adds three groups of tests to test_triton_preprocess.py:

  - Input validation: rejects CPU tensors, wrong dtype, wrong shape,
    mismatched out buffer shape/dtype.
  - Buffer reuse: verifies the kernel writes into a provided out tensor
    without reallocating (the real adapter relies on this).
  - End-to-end parity against pre_process_network_input STRETCH_TO: the
    Triton fast path must match the non-Triton wrapper the model calls
    when RFDETR_USE_TRITON_PREPROC is off. Tolerance (2-LSB / min(std))
    covers cv2 5-bit fixed-point vs Triton fp32 bilinear divergence.
  - Solid-color analytic parity: every output pixel exactly equals the
    per-channel (x/255 - mean)/std on a uniform input.

13 tests total in the file (was 3), all passing on T4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…H_FOR_PREPROCESSING path

Adds tests/inference/unit_tests/models/rfdetr/test_triton_preprocess_vs_pytorch.py
comparing triton_preprocess_rfdetr (the kernel called from _try_triton_preprocess
in inference/models/rfdetr/rfdetr.py) against the USE_PYTORCH_FOR_PREPROCESSING
code path.

Findings (each pinned by a dedicated test):

1. The production pytorch path normalizes BGR-ordered channels using
   RGB-ordered means/stds, then swaps BGR->RGB at the end. Net: output
   channel 0 (R) gets R-pixel data normalized with B's mean/std, and
   vice versa for channel 2. Only G is self-consistent.
   See test_documented_pytorch_path_channel_mixup.

2. The production pytorch path pads the letterbox region with raw 114.0
   in the already-normalized tensor, instead of the per-channel normalized
   grey the kernel emits.
   See test_documented_pytorch_path_pad_region.

3. When (target_dim - scaled_dim) is odd, the pytorch path uses asymmetric
   integer padding (top=floor/2, bottom=remainder) while the kernel keeps
   pad_y as a float in its inverse bilinear. This introduces a half-pixel
   vertical drift between the two outputs.
   See test_letterbox_half_pixel_case_diverges_predictably.

Content-region parity tests use source shapes where (target-scaled) is
even on both axes to sidestep the half-pixel drift; the kernel matches the
"what the pytorch path was trying to express" reference to ~1 LSB.

Also pins kernel determinism and the solid-color analytic case.

11 new tests, all passing on T4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant