perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC by aseembits93 · Pull Request #36 · aseembits93/inference

aseembits93 · 2026-05-04T19:34:08Z

What does this PR do?

This PR adds an opt-in Triton post-processing fast path for RF-DETR instance
segmentation behind RFDETR_TRITON_FULLPOSTPROC=true. It fuses the filter /
box decode work into a Triton kernel, keeps the existing torch reference path
as the fallback when eligibility checks fail, and updates the adapter to return
only live survivor rows.

It also fixes correctness issues found while validating the fast path:

Slice fullpost outputs to n_survivors before constructing detections.
Emit one row per (query, class) so the Triton path matches the reference
top-k-over-(Q x C) semantics.
Use F.interpolate(..., antialias=True) for mask upsampling so masks match
the torch reference bit-for-bit.

Related Issue(s): None

Type of Change

Other: Performance improvement

Testing

I have tested this change locally
I have added/updated tests for this change

Test details:

Benchmarked rfdetr-seg-nano TRT on vehicles_312px.mp4 over 538 frames
(1 warmup + 4 measured runs per config): mean FPS improved from 114.93 to
137.14 with RFDETR_TRITON_FULLPOSTPROC=true (+22.21 FPS, +19.3%).
Ran temp/parity_same_logits.py on 300 images using the exact same
(bboxes, logits, masks) tensors for both post-processors: 0
count-mismatch images, 0 unmatched detections, mean |Δscore| = 1.348e-08,
and mean box IoU 0.999843.
Ran temp/detection_parity_full.py on 1500 coco/val2017 images with
independent forward passes for RFDETR_TRITON_FULLPOSTPROC=true and
false: 8037 / 8037 detections, 100% IoU>0.5 matches, and
8037 / 8037 pixel-identical masks after switching mask upsampling to
F.interpolate(..., antialias=True).

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in
hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

The new path is opt-in and only activates when
RFDETR_TRITON_FULLPOSTPROC=true and the request satisfies the fullpost
eligibility checks (batch=1, no static crop, single inference size, class
remapping present). Otherwise the existing torch reference path is unchanged.
Main implementation:
inference_models/inference_models/models/rfdetr/triton_fullpostproc.py
Adapter integration:
inference/core/models/inference_models_adapters.py
Workflow benchmark script:
development/stream_interface/rfdetr_nano_seg_trt_workflow.py

…ULLPOSTPROC Adds two fused Triton kernels that replace the post-TRT chain for the common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize, class remapping present): - _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold + cxcywh->xyxy + letterbox denormalize + clip + banker's rounding; atomic_add into a counter to reserve a compact output slot. - _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig size + threshold + uint8 emit, with GPU-side early exit against the counter so no CPU sync is needed between the two kernels. Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the torch reference path when Triton is unavailable or the eligibility checks fail. The adapter reads a 4-byte counter via a pinned DtoH to learn ``n_survivors`` and then async-DtoHs the compact combined/mask slices.

The non-RLE fullpost path was returning the full num_queries-row scratch buffer to InstanceDetections, exposing uninitialized rows past the survivor counter and leaving the conf column as int32 bits. Wait on done_event, slice combined and mask_bin to [:n_survivors], and reinterpret the conf column with .view(torch.float32) — mirroring the RLE variant. Adds temp/detection_parity_full.py to study the fused path against the torch reference across coco/val2017.

The filter kernel was per-query argmax, but the torch reference does top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a single query can contribute multiple detections, one per class that survives remap + threshold. Per-query argmax silently dropped the secondary classes. Reshape the kernel grid to (num_queries, num_classes_total) so each (q, c) pair is processed independently: load one logit, remap, sigmoid, threshold, emit. Cap output at num_queries to mirror the reference's top-K cap; host clamps n_survivors to combined.shape[0] since the atomic counter increments before the slot guard. Validated on coco/val2017 (1500 images): det counts match exactly 8037/8037 vs 7995/8037 before, and on the same-logits parity script all 1663 matched detections agree to fp32 epsilon when the matcher is class-aware.

The custom Triton mask kernel implemented bilinear+threshold with `antialias=False` semantics, but the reference path (`align_instance_segmentation_results`) calls `functional.resize(BILINEAR)` which defaults to `antialias=True` — producing a different fp32 reduction order on boundary pixels and flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images at conf=0.4). Drop the mask kernel and replace it with a host-side gather + `F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0` that matches the reference bit-for-bit. Same eligibility window guarantees no static crop and `size_after_pre_processing == inference_size`, so the new path can skip the canvas/static-crop branches. Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks (was 8011/8037). FPS unchanged within noise (filter kernel dominates the postproc cost; the mask path was always cuDNN-bound).

aseembits93 force-pushed the perf/trt-rfdetr-triton-fullpostproc branch from 9112bf3 to 5ff8c7a Compare May 15, 2026 17:02

aseembits93 added 2 commits May 15, 2026 10:03

moving to rle path

986dbdd

aseembits93 force-pushed the perf/trt-rfdetr-triton-fullpostproc branch from 5ff8c7a to 986dbdd Compare May 15, 2026 17:04

aseembits93 added 6 commits May 15, 2026 17:44

workflow script

29c7627

move env var to inference_models/configuration.py

1804cd3

refactoring to change var names

fe46c57

aseembits93 mentioned this pull request May 19, 2026

perf(rfdetr-seg): single-kernel Triton full post-processing #40

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36

perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36
aseembits93 wants to merge 8 commits into
mainfrom
perf/trt-rfdetr-triton-fullpostproc

aseembits93 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aseembits93 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Testing

Checklist

Additional Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aseembits93 commented May 4, 2026 •

edited

Loading