Skip to content

rfdetr-seg: single-kernel Triton full post-processing#2378

Closed
aseembits93 wants to merge 27 commits into
roboflow:mainfrom
aseembits93:postproc-fullkernel
Closed

rfdetr-seg: single-kernel Triton full post-processing#2378
aseembits93 wants to merge 27 commits into
roboflow:mainfrom
aseembits93:postproc-fullkernel

Conversation

@aseembits93
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR opens the full RF-DETR segmentation Triton work behind RFDETR_TRITON_POSTPROC=true:

  • Fuse the steady-state RF-DETR TRT post-processing path into one Triton
    kernel: flat top-k over the (Q x C) sigmoid grid, class remap /
    thresholding, box decode / scaling / banker's rounding, full mask resize /
    threshold, packed workflow masks, and optional in-kernel COCO RLE
    serialization.
  • Preserve the torch reference path as the fallback when any Triton fast-path
    precondition fails.

Type of Change

  • Other: Performance improvement

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details:

  • Performance gains on TensorRT video input. Run with:
RFDETR_TRITON_POSTPROC="false" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
  --video_reference vehicles_312px.mp4 --backend trt

RFDETR_TRITON_POSTPROC="true" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
  --video_reference vehicles_312px.mp4 --backend trt

vehicles_312px.mp4 (538 frames, src 312x176):

fps ms/frame
Torch reference (env=false) 34.84 28.70
Triton fast path (env=true) 40.48 24.70
Delta +16.2% -4.00 ms

vehicles_720p.mp4 (538 frames, src 1280x720):

fps elapsed ms/frame
Torch reference (env=false) 15.57 34.56 s 64.24
Triton fast path (env=true) 20.78 25.89 s 48.12
Delta +33.5% -8.67 s -16.12 ms

vehicles_1080p.mp4 (538 frames, src 1920x1080):

fps elapsed ms/frame
Torch reference (env=false) 10.87 49.49 s 92.00
Triton fast path (env=true) 14.62 36.80 s 68.40
Delta +34.5% -12.69 s -23.60 ms
  • Correctness guarantees on the full COCO val2017 set. Run: (Make sure PYTHONPATH is picking up the local ./inference_models folder)
python temp/detection_parity_full.py
Triton fast path (env=true) Torch reference (env=false)
Triton kernel calls 5000 / 5000 0
Detections 26,721 26,721
Matched at IoU>0.5 26,721 (100%)
Mean box IoU 0.999999
Mean |Δscore| 1.450e-08
Max |Δscore| 1.192e-07
Class-id disagreements 0
Mean / min mask IoU 0.999289 / 0.000000
Pixel-identical masks 26,721 / 26,721
  • Explicit RLE subprocess parity on 50 coco/val2017 images with
    mask_format='rle', response_mask_format='rle'
    -> 278 / 278 pixel-identical decoded masks, 0 count-mismatch images,
    max |Δscore| = 1.192e-07.

  • Unit tests

pytest inference_models/tests/unit_tests/models/common/test_rle_utils.py -q

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in
    hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

…ULLPOSTPROC

Adds two fused Triton kernels that replace the post-TRT chain for the
common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize,
class remapping present):

  - _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold
    + cxcywh->xyxy + letterbox denormalize + clip + banker's rounding;
    atomic_add into a counter to reserve a compact output slot.
  - _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig
    size + threshold + uint8 emit, with GPU-side early exit against the
    counter so no CPU sync is needed between the two kernels.

Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the
torch reference path when Triton is unavailable or the eligibility checks
fail. The adapter reads a 4-byte counter via a pinned DtoH to learn
``n_survivors`` and then async-DtoHs the compact combined/mask slices.
The non-RLE fullpost path was returning the full num_queries-row scratch
buffer to InstanceDetections, exposing uninitialized rows past the
survivor counter and leaving the conf column as int32 bits. Wait on
done_event, slice combined and mask_bin to [:n_survivors], and
reinterpret the conf column with .view(torch.float32) — mirroring the
RLE variant.

Adds temp/detection_parity_full.py to study the fused path against the
torch reference across coco/val2017.
The filter kernel was per-query argmax, but the torch reference does
top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a
single query can contribute multiple detections, one per class that
survives remap + threshold. Per-query argmax silently dropped the
secondary classes.

Reshape the kernel grid to (num_queries, num_classes_total) so each
(q, c) pair is processed independently: load one logit, remap, sigmoid,
threshold, emit. Cap output at num_queries to mirror the reference's
top-K cap; host clamps n_survivors to combined.shape[0] since the
atomic counter increments before the slot guard.

Validated on coco/val2017 (1500 images): det counts match exactly
8037/8037 vs 7995/8037 before, and on the same-logits parity script
all 1663 matched detections agree to fp32 epsilon when the matcher is
class-aware.
The custom Triton mask kernel implemented bilinear+threshold with
`antialias=False` semantics, but the reference path
(`align_instance_segmentation_results`) calls
`functional.resize(BILINEAR)` which defaults to `antialias=True` —
producing a different fp32 reduction order on boundary pixels and
flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images
at conf=0.4).

Drop the mask kernel and replace it with a host-side gather +
`F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0`
that matches the reference bit-for-bit. Same eligibility window guarantees
no static crop and `size_after_pre_processing == inference_size`, so the
new path can skip the canvas/static-crop branches.

Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks
(was 8011/8037). FPS unchanged within noise (filter kernel dominates the
postproc cost; the mask path was always cuDNN-bound).
@aseembits93
Copy link
Copy Markdown
Contributor Author

Opening another PR with an alternative kernel which is even faster for larger images.

@aseembits93 aseembits93 closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant