perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36
Open
aseembits93 wants to merge 8 commits into
Open
perf(rfdetr-seg): fused Triton post-processing behind RFDETR_TRITON_FULLPOSTPROC#36aseembits93 wants to merge 8 commits into
aseembits93 wants to merge 8 commits into
Conversation
9112bf3 to
5ff8c7a
Compare
…ULLPOSTPROC
Adds two fused Triton kernels that replace the post-TRT chain for the
common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize,
class remapping present):
- _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold
+ cxcywh->xyxy + letterbox denormalize + clip + banker's rounding;
atomic_add into a counter to reserve a compact output slot.
- _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig
size + threshold + uint8 emit, with GPU-side early exit against the
counter so no CPU sync is needed between the two kernels.
Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the
torch reference path when Triton is unavailable or the eligibility checks
fail. The adapter reads a 4-byte counter via a pinned DtoH to learn
``n_survivors`` and then async-DtoHs the compact combined/mask slices.
5ff8c7a to
986dbdd
Compare
The non-RLE fullpost path was returning the full num_queries-row scratch buffer to InstanceDetections, exposing uninitialized rows past the survivor counter and leaving the conf column as int32 bits. Wait on done_event, slice combined and mask_bin to [:n_survivors], and reinterpret the conf column with .view(torch.float32) — mirroring the RLE variant. Adds temp/detection_parity_full.py to study the fused path against the torch reference across coco/val2017.
The filter kernel was per-query argmax, but the torch reference does top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a single query can contribute multiple detections, one per class that survives remap + threshold. Per-query argmax silently dropped the secondary classes. Reshape the kernel grid to (num_queries, num_classes_total) so each (q, c) pair is processed independently: load one logit, remap, sigmoid, threshold, emit. Cap output at num_queries to mirror the reference's top-K cap; host clamps n_survivors to combined.shape[0] since the atomic counter increments before the slot guard. Validated on coco/val2017 (1500 images): det counts match exactly 8037/8037 vs 7995/8037 before, and on the same-logits parity script all 1663 matched detections agree to fp32 epsilon when the matcher is class-aware.
The custom Triton mask kernel implemented bilinear+threshold with `antialias=False` semantics, but the reference path (`align_instance_segmentation_results`) calls `functional.resize(BILINEAR)` which defaults to `antialias=True` — producing a different fp32 reduction order on boundary pixels and flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images at conf=0.4). Drop the mask kernel and replace it with a host-side gather + `F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0` that matches the reference bit-for-bit. Same eligibility window guarantees no static crop and `size_after_pre_processing == inference_size`, so the new path can skip the canvas/static-crop branches. Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks (was 8011/8037). FPS unchanged within noise (filter kernel dominates the postproc cost; the mask path was always cuDNN-bound).
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds an opt-in Triton post-processing fast path for RF-DETR instance
segmentation behind
RFDETR_TRITON_FULLPOSTPROC=true. It fuses the filter /box decode work into a Triton kernel, keeps the existing torch reference path
as the fallback when eligibility checks fail, and updates the adapter to return
only live survivor rows.
It also fixes correctness issues found while validating the fast path:
n_survivorsbefore constructing detections.(query, class)so the Triton path matches the referencetop-k-over-
(Q x C)semantics.F.interpolate(..., antialias=True)for mask upsampling so masks matchthe torch reference bit-for-bit.
Related Issue(s): None
Type of Change
Testing
Test details:
rfdetr-seg-nanoTRT onvehicles_312px.mp4over 538 frames(1 warmup + 4 measured runs per config): mean FPS improved from
114.93to137.14withRFDETR_TRITON_FULLPOSTPROC=true(+22.21 FPS,+19.3%).temp/parity_same_logits.pyon 300 images using the exact same(bboxes, logits, masks)tensors for both post-processors:0count-mismatch images,
0unmatched detections, mean|Δscore| = 1.348e-08,and mean box IoU
0.999843.temp/detection_parity_full.pyon 1500coco/val2017images withindependent forward passes for
RFDETR_TRITON_FULLPOSTPROC=trueandfalse:8037 / 8037detections,100%IoU>0.5 matches, and8037 / 8037pixel-identical masks after switching mask upsampling toF.interpolate(..., antialias=True).Checklist
hard-to-understand areas
Additional Context
RFDETR_TRITON_FULLPOSTPROC=trueand the request satisfies the fullposteligibility checks (
batch=1, no static crop, single inference size, classremapping present). Otherwise the existing torch reference path is unchanged.
inference_models/inference_models/models/rfdetr/triton_fullpostproc.pyinference/core/models/inference_models_adapters.pydevelopment/stream_interface/rfdetr_nano_seg_trt_workflow.py