rfdetr-seg: single-kernel Triton full post-processing by aseembits93 · Pull Request #2378 · roboflow/inference

aseembits93 · 2026-05-28T04:21:01Z

What does this PR do?

This PR opens the full RF-DETR segmentation Triton work behind RFDETR_TRITON_POSTPROC=true:

Fuse the steady-state RF-DETR TRT post-processing path into one Triton
kernel: flat top-k over the (Q x C) sigmoid grid, class remap /
thresholding, box decode / scaling / banker's rounding, full mask resize /
threshold, packed workflow masks, and optional in-kernel COCO RLE
serialization.
Preserve the torch reference path as the fallback when any Triton fast-path
precondition fails.

Type of Change

Other: Performance improvement

Testing

I have tested this change locally
I have added/updated tests for this change

Test details:

Performance gains on TensorRT video input. Run with:

RFDETR_TRITON_POSTPROC="false" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
  --video_reference vehicles_312px.mp4 --backend trt

RFDETR_TRITON_POSTPROC="true" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
  --video_reference vehicles_312px.mp4 --backend trt

vehicles_312px.mp4 (538 frames, src 312x176):

	fps	ms/frame
Torch reference (env=false)	34.84	28.70
Triton fast path (env=true)	40.48	24.70
Delta	+16.2%	-4.00 ms

vehicles_720p.mp4 (538 frames, src 1280x720):

	fps	elapsed	ms/frame
Torch reference (env=false)	15.57	34.56 s	64.24
Triton fast path (env=true)	20.78	25.89 s	48.12
Delta	+33.5%	-8.67 s	-16.12 ms

vehicles_1080p.mp4 (538 frames, src 1920x1080):

	fps	elapsed	ms/frame
Torch reference (env=false)	10.87	49.49 s	92.00
Triton fast path (env=true)	14.62	36.80 s	68.40
Delta	+34.5%	-12.69 s	-23.60 ms

Correctness guarantees on the full COCO val2017 set. Run: (Make sure PYTHONPATH is picking up the local ./inference_models folder)

python temp/detection_parity_full.py

	Triton fast path (env=true)	Torch reference (env=false)
Triton kernel calls	5000 / 5000	0
Detections	26,721	26,721
Matched at IoU>0.5	26,721 (100%)	—
Mean box IoU	0.999999	—
Mean \|Δscore\|	1.450e-08	—
Max \|Δscore\|	1.192e-07	—
Class-id disagreements	0	—
Mean / min mask IoU	0.999289 / 0.000000	—
Pixel-identical masks	26,721 / 26,721	—

Explicit RLE subprocess parity on 50 coco/val2017 images with
mask_format='rle', response_mask_format='rle'
-> 278 / 278 pixel-identical decoded masks, 0 count-mismatch images,
max |Δscore| = 1.192e-07.
Unit tests

pytest inference_models/tests/unit_tests/models/common/test_rle_utils.py -q

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in
hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

…ULLPOSTPROC Adds two fused Triton kernels that replace the post-TRT chain for the common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize, class remapping present): - _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold + cxcywh->xyxy + letterbox denormalize + clip + banker's rounding; atomic_add into a counter to reserve a compact output slot. - _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig size + threshold + uint8 emit, with GPU-side early exit against the counter so no CPU sync is needed between the two kernels. Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the torch reference path when Triton is unavailable or the eligibility checks fail. The adapter reads a 4-byte counter via a pinned DtoH to learn ``n_survivors`` and then async-DtoHs the compact combined/mask slices.

The non-RLE fullpost path was returning the full num_queries-row scratch buffer to InstanceDetections, exposing uninitialized rows past the survivor counter and leaving the conf column as int32 bits. Wait on done_event, slice combined and mask_bin to [:n_survivors], and reinterpret the conf column with .view(torch.float32) — mirroring the RLE variant. Adds temp/detection_parity_full.py to study the fused path against the torch reference across coco/val2017.

The filter kernel was per-query argmax, but the torch reference does top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a single query can contribute multiple detections, one per class that survives remap + threshold. Per-query argmax silently dropped the secondary classes. Reshape the kernel grid to (num_queries, num_classes_total) so each (q, c) pair is processed independently: load one logit, remap, sigmoid, threshold, emit. Cap output at num_queries to mirror the reference's top-K cap; host clamps n_survivors to combined.shape[0] since the atomic counter increments before the slot guard. Validated on coco/val2017 (1500 images): det counts match exactly 8037/8037 vs 7995/8037 before, and on the same-logits parity script all 1663 matched detections agree to fp32 epsilon when the matcher is class-aware.

The custom Triton mask kernel implemented bilinear+threshold with `antialias=False` semantics, but the reference path (`align_instance_segmentation_results`) calls `functional.resize(BILINEAR)` which defaults to `antialias=True` — producing a different fp32 reduction order on boundary pixels and flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images at conf=0.4). Drop the mask kernel and replace it with a host-side gather + `F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0` that matches the reference bit-for-bit. Same eligibility window guarantees no static crop and `size_after_pre_processing == inference_size`, so the new path can skip the canvas/static-crop branches. Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks (was 8011/8037). FPS unchanged within noise (filter kernel dominates the postproc cost; the mask path was always cuDNN-bound).

fastpath

aseembits93 · 2026-06-03T01:45:02Z

Opening another PR with an alternative kernel which is even faster for larger images.

aseembits93 added 27 commits May 15, 2026 10:03

moving to rle path

986dbdd

workflow script

29c7627

move env var to inference_models/configuration.py

1804cd3

refactoring to change var names

fe46c57

Fuse full RF-DETR Triton post-processing

c9db910

Stop forwarding response mask format in adapter

d3030bf

Pack fused detections into one host copy

2ffb62b

Drop redundant Triton postproc counter reset

3330429

Defer RF-DETR survivor count to adapter

9bfb5b1

Bit-pack RF-DETR workflow mask transfer

90cde16

restore parity

0af52c8

Add RF-DETR postprocess microbench

bc84b9d

Use lru_cache for pure Triton postproc caches

fcd23f2

reduce preconditions on kernel eligibility

79bd5bd

fastpath

cd078a2

Merge pull request #41 from aseembits93/perf-rfdetr-fullpost-fastpath

0a7a2b4

fastpath

add correctness and integration test

5ff1297

Merge branch 'main' into postproc-fullkernel

6365b45

bound pinned memory with unit tests

6008c61

revert import changes

d420ab5

make style make check_code_quality

3c2fa90

patch benchmark scripts to work on jetson

9cb4033

bugfix

308c1d8

aseembits93 requested review from PawelPeczek-Roboflow, grzegorz-roboflow and hansent as code owners May 28, 2026 04:21

aseembits93 requested review from dkosowski87, probicheaux, rafel-roboflow and yeldarby as code owners May 28, 2026 04:21

aseembits93 closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfdetr-seg: single-kernel Triton full post-processing#2378

rfdetr-seg: single-kernel Triton full post-processing#2378
aseembits93 wants to merge 27 commits into
roboflow:mainfrom
aseembits93:postproc-fullkernel

aseembits93 commented May 28, 2026

Uh oh!

aseembits93 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aseembits93 commented May 28, 2026

What does this PR do?

Type of Change

Testing

Checklist

Additional Context

Uh oh!

aseembits93 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant