Skip to content

perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36

Open
aseembits93 wants to merge 8 commits into
mainfrom
perf/trt-rfdetr-triton-fullpostproc
Open

perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36
aseembits93 wants to merge 8 commits into
mainfrom
perf/trt-rfdetr-triton-fullpostproc

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

@aseembits93 aseembits93 commented May 4, 2026

What does this PR do?

This PR adds an opt-in Triton post-processing fast path for RF-DETR instance
segmentation behind RFDETR_TRITON_FULLPOSTPROC=true. It fuses the filter /
box decode work into a Triton kernel, keeps the existing torch reference path
as the fallback when eligibility checks fail, and updates the adapter to return
only live survivor rows.

It also fixes correctness issues found while validating the fast path:

  • Slice fullpost outputs to n_survivors before constructing detections.
  • Emit one row per (query, class) so the Triton path matches the reference
    top-k-over-(Q x C) semantics.
  • Use F.interpolate(..., antialias=True) for mask upsampling so masks match
    the torch reference bit-for-bit.

Related Issue(s): None

Type of Change

  • Other: Performance improvement

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details:

  • Benchmarked rfdetr-seg-nano TRT on vehicles_312px.mp4 over 538 frames
    (1 warmup + 4 measured runs per config): mean FPS improved from 114.93 to
    137.14 with RFDETR_TRITON_FULLPOSTPROC=true (+22.21 FPS, +19.3%).
  • Ran temp/parity_same_logits.py on 300 images using the exact same
    (bboxes, logits, masks) tensors for both post-processors: 0
    count-mismatch images, 0 unmatched detections, mean |Δscore| = 1.348e-08,
    and mean box IoU 0.999843.
  • Ran temp/detection_parity_full.py on 1500 coco/val2017 images with
    independent forward passes for RFDETR_TRITON_FULLPOSTPROC=true and
    false: 8037 / 8037 detections, 100% IoU>0.5 matches, and
    8037 / 8037 pixel-identical masks after switching mask upsampling to
    F.interpolate(..., antialias=True).

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in
    hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

  • The new path is opt-in and only activates when
    RFDETR_TRITON_FULLPOSTPROC=true and the request satisfies the fullpost
    eligibility checks (batch=1, no static crop, single inference size, class
    remapping present). Otherwise the existing torch reference path is unchanged.
  • Main implementation:
    inference_models/inference_models/models/rfdetr/triton_fullpostproc.py
  • Adapter integration:
    inference/core/models/inference_models_adapters.py
  • Workflow benchmark script:
    development/stream_interface/rfdetr_nano_seg_trt_workflow.py

@aseembits93 aseembits93 force-pushed the perf/trt-rfdetr-triton-fullpostproc branch from 9112bf3 to 5ff8c7a Compare May 15, 2026 17:02
…ULLPOSTPROC

Adds two fused Triton kernels that replace the post-TRT chain for the
common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize,
class remapping present):

  - _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold
    + cxcywh->xyxy + letterbox denormalize + clip + banker's rounding;
    atomic_add into a counter to reserve a compact output slot.
  - _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig
    size + threshold + uint8 emit, with GPU-side early exit against the
    counter so no CPU sync is needed between the two kernels.

Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the
torch reference path when Triton is unavailable or the eligibility checks
fail. The adapter reads a 4-byte counter via a pinned DtoH to learn
``n_survivors`` and then async-DtoHs the compact combined/mask slices.
@aseembits93 aseembits93 force-pushed the perf/trt-rfdetr-triton-fullpostproc branch from 5ff8c7a to 986dbdd Compare May 15, 2026 17:04
The non-RLE fullpost path was returning the full num_queries-row scratch
buffer to InstanceDetections, exposing uninitialized rows past the
survivor counter and leaving the conf column as int32 bits. Wait on
done_event, slice combined and mask_bin to [:n_survivors], and
reinterpret the conf column with .view(torch.float32) — mirroring the
RLE variant.

Adds temp/detection_parity_full.py to study the fused path against the
torch reference across coco/val2017.
The filter kernel was per-query argmax, but the torch reference does
top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a
single query can contribute multiple detections, one per class that
survives remap + threshold. Per-query argmax silently dropped the
secondary classes.

Reshape the kernel grid to (num_queries, num_classes_total) so each
(q, c) pair is processed independently: load one logit, remap, sigmoid,
threshold, emit. Cap output at num_queries to mirror the reference's
top-K cap; host clamps n_survivors to combined.shape[0] since the
atomic counter increments before the slot guard.

Validated on coco/val2017 (1500 images): det counts match exactly
8037/8037 vs 7995/8037 before, and on the same-logits parity script
all 1663 matched detections agree to fp32 epsilon when the matcher is
class-aware.
The custom Triton mask kernel implemented bilinear+threshold with
`antialias=False` semantics, but the reference path
(`align_instance_segmentation_results`) calls
`functional.resize(BILINEAR)` which defaults to `antialias=True` —
producing a different fp32 reduction order on boundary pixels and
flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images
at conf=0.4).

Drop the mask kernel and replace it with a host-side gather +
`F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0`
that matches the reference bit-for-bit. Same eligibility window guarantees
no static crop and `size_after_pre_processing == inference_size`, so the
new path can skip the canvas/static-crop branches.

Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks
(was 8011/8037). FPS unchanged within noise (filter kernel dominates the
postproc cost; the mask path was always cuDNN-bound).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant