perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px) by aseembits93 · Pull Request #24 · aseembits93/inference

aseembits93 · 2026-04-29T22:53:54Z

Summary

Standalone PR: RF-DETR nano seg TensorRT e2e latency optimizations through to a depth-2 frame pipeline, benchmarked on vehicles_312px.mp4 (538 frames, 312×176) with the minimal InferencePipeline benchmark script.

Metric	Baseline	depth=1 (default after this PR)	depth=2 (opt-in via `RFDETR_PIPELINE_DEPTH=2`)
Avg FPS (Tesla T4, FP16 TRT)	109	151 (+38%)	241 (+121%)
Parity vs v16 best	—	0 diff across 538 frames	0 diff on 537 shifted pairs

All new paths are opt-in behind env vars; defaults preserve current behavior (except for the two always-on CPU-side fixes noted below).

Benchmark:

python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
    --video_reference /home/ubuntu/inference/vehicles_312px.mp4

Flags to enable (depth=1, matches 151 FPS row):

RFDETR_USE_TRITON_PREPROC=true
RFDETR_TRITON_FULLPOSTPROC=true
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true

Flags to enable (depth=2, matches 241 FPS row): above three plus RFDETR_PIPELINE_DEPTH=2.

Commit-by-commit

#	Commit	Summary
1	`419c5a7` initial point	Branch starting point.
2	`9ec6a50` fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction	First landing: Triton preprocess kernel, Triton conf-filter postprocess, CUDA-graph enablement, drop pre/post stream syncs.
3	`807ea1a` replace benchmark script with minimal InferencePipeline-based version	Switch benchmark harness to a 116-line InferencePipeline driver so measurements match production code paths.
4	`5ff27d1` full Triton postproc fusion + deferred counter sync (W2)	Collapses the post-TRT chain into two Triton kernels (`_rfdetr_fullpost_filter_kernel` + `_rfdetr_fullpost_mask_kernel_compact`). Removes the CPU-blocking `counter.item()` in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under a `torch.cuda.Event` guard. Includes atomic-counter compaction, `_trt_reuse_as_input_buffer` marker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no more `stream.synchronize()` in the graph-replay branch).
5	`c1406a8` cache per-call scratch + class_mapping int32 view (W7)	Eliminates two stray per-frame kernel launches that nsys flagged: `direct_copy_kernel_cuda` (from `class_mapping.to(dtype=torch.int32)` every frame — now cached by `id()`) and the three `torch.empty` allocator calls per frame (now a persistent scratch cache keyed on `(num_queries, device)`).
6	`1b169b7` depth-2 frame pipeline (151 → 241 FPS)	Opt-in ping-pong of Triton postproc scratch slots so the adapter drains frame N-1's results while frame N's kernels are launching. Gated behind `RFDETR_PIPELINE_DEPTH=2`; default 1 is bit-exact-unchanged.

What's in

New Triton kernels

inference_models/models/rfdetr/triton_preprocess.py — fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.
inference_models/models/rfdetr/triton_postprocess.py — fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used when RFDETR_TRITON_POSTPROC=true alone.
inference_models/models/rfdetr/triton_fullpostproc.py — full post-process fusion:
- _rfdetr_fullpost_filter_kernel — per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding; tl.atomic_add(counter, 1) reserves a compact output slot.
- _rfdetr_fullpost_mask_kernel_compact — GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.
- Returns unsliced buffers + counter + done_event; adapter does the host-side slice after a single pinned counter DtoH.
- depth-2 extension: _get_scratch_buffers / _get_mask_bin_buffer keyed by (num_queries, device, slot_idx); triton_rfdetr_fullpost accepts num_slots (default 1 = legacy single-slot).
inference/models/rfdetr/triton_preprocess.py — same preprocess kernel wired into the legacy inference/models/rfdetr path for parity.

Wiring / dispatch

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Triton preprocess fast-path in pre_process() with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop).
- Pre-allocated input buffer with _trt_reuse_as_input_buffer marker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy.
- Pinned host staging buffer (_fast_src_host_pinned) so copy_(non_blocking=True) actually runs async.
- Cross-stream event recording (_fast_preproc_event) instead of pre_process_stream.synchronize().
inference_models/models/rfdetr/common.py — dispatches RFDETR_TRITON_FULLPOSTPROC first (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch. Reads RFDETR_PIPELINE_DEPTH and passes num_slots=RFDETR_PIPELINE_DEPTH to triton_rfdetr_fullpost.
inference_models/models/common/trt.py:
- TRTCudaGraphState.consumer_done_event lets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers).
- _capture_cuda_graph(use_pre_processed_images_as_input_buffer=...) supports capturing with an externally-owned input tensor.
- Removed stream.synchronize() in infer_from_trt_engine's graph-replay branch; replaced with a produce_event recorded on the graph's own stream.
inference/core/models/inference_models_adapters.py
- GPU fast-path for the _combined_gpu / _counter_gpu / _postproc_done_event side channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined[:n] + mask[:n] and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 via numpy.view(np.float32) for confidence.
- Factored per-det response-build into _build_response_from_det (used by both depth=1 and depth=2 code paths for bit-exact responses).
- depth=2 branch: saves (det, H, W, class_filter) as _pipeline_pending so the scratch slot stays alive; drains previous pending (waits its done_event, pinned-DtoH counter+combined+mask, builds response, returns frame N-1's response). First call returns an empty response.

Scratch caching (W7)

_SCRATCH_CACHE — combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by (num_queries, device) (+ slot_idx for depth=2).
_CLASS_MAPPING_INT32_CACHE — int32 view of the class_mapping tensor cached by id(source_tensor); previously re-converted every frame because upstream stores it as int64.

Benchmark harness

development/stream_interface/rfdetr_nano_seg_trt_workflow.py — replaced with a minimal 116-line InferencePipeline-based driver. Uses the roboflow_core/roboflow_instance_segmentation_model@v3 workflow block; single video source, single sink, no annotators/buffering/rate-limiting. --backend flag pre-import sets DISABLED_INFERENCE_MODELS_BACKENDS for unambiguous backend attribution.

Env vars (all opt-in)

Var	Default	Effect
`RFDETR_USE_TRITON_PREPROC`	off	Enable fused Triton preprocess kernel
`RFDETR_TRITON_POSTPROC`	off	Enable Triton conf-filter (partial postproc fusion)
`RFDETR_TRITON_FULLPOSTPROC`	off	Enable full postproc fusion (W2)
`RFDETR_GPU_POSTPROCESS`	on	GPU mask-emptiness skip
`RFDETR_DISABLE_GPU_PREPROC`	off	Opt-out for torch GPU preprocess branch
`ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND`	off	Enables `TRTCudaGraphCache`
`RFDETR_PIPELINE_DEPTH`	`1`	Set to `2` to enable depth-2 frame pipeline (see caveats below)

How depth-2 works

Before depth-2, the inference thread was serial per frame:

HtoD -> TRT -> Triton postproc -> counter DtoH (sync) -> mask DtoH (sync) -> build response -> return

cudaStreamSynchronize was 80.9% of CUDA API time per nsys — the CPU sat idle waiting for GPU work to complete, then blocked on DtoH.

Depth-2 ping-pongs the Triton postproc scratch buffers between two slots so the adapter can drain frame N-1's results while frame N's kernels are launching into the other slot. The cudaStreamSynchronize cost is now overlapped with the next frame's GPU work instead of blocking the workflow thread.

Per-frame flow (depth=2):

Frame N's Triton kernels write to scratch slot N % 2.
Adapter's postprocess() receives frame N's detections.
Instead of draining frame N's scratch, it:
- Saves (det, H, W, class_filter) as the new _pipeline_pending — the refs keep slot N % 2 alive.
- Drains the previous pending (slot (N-1) % 2): waits its done_event, pinned-DtoH counter+combined+mask, builds response.
- Returns frame N-1's response.
First call (frame 0): no previous pending → returns an empty response.

The scratch slot that frame N writes is different from the slot frame N-1 is reading, so the two frames' work genuinely runs in parallel on the GPU.

Caveats (depth=2 only, clearly gated)

Aspect	depth=1	depth=2
Response-to-frame binding	exact	shifted by 1 (response_N-1 paired with video_frame_N in `on_prediction`)
Frame 0	normal response	empty response
Final frame	emitted	dropped (no subsequent call to drain its pending slot)
Per-frame wall time on workflow thread	~6.6 ms	~4.1 ms

At 241 FPS the 1-frame skew is ~4 ms between a detection's underlying pixels and the video_frame it's paired with in the sink callback. Imperceptible for annotation rendering, dashboards, downstream analytics. For strict frame-detection pairing (e.g. training label generation), keep depth=1.

The final-frame drop is a real loss (one Prediction per stream). If that matters, an explicit flush() method could be added on adapter teardown — not plumbed in this PR because InferencePipeline doesn't currently call into the model on shutdown.

Parity verification

depth=1 (default): 0 diff across all 538 frames vs v16 best (bit-exact xyxy, confidence, class_id, per-detection mask MD5).
depth=2:
- Frame 0: empty response — matches "warmup slot" design.
- Frames 1..537: bit-exact match to frames 0..536 of the depth=1 capture. Zero mismatches across 537 shifted pairs.
- Frame 537 of depth=1 (1 detection, c2:0.9024:68,153,105,176): dropped in depth=2 (stream ended before next call could drain its pending slot).

Test plan

Relationship to #22

This PR now subsumes #22 — it contains all the commits from #22 plus the depth-2 pipeline commit on top. Merging this PR makes #22 redundant; #22 can be closed once this lands. If you prefer to land the 151-FPS work first without depth-2, #22 is still the smaller PR and this one can be rebased on main afterward.

… stream sync reduction Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine, example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes below + enabling the existing CUDA-graph cache: Baseline (no changes) 93.07 FPS + Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6) + U7 mask-decode skip for empty masks ~94 FPS (flag-gated) + Triton postprocess conf-filter 98.6 FPS (+5.9%) + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%) + Drop pre/post stream syncs 102.2 FPS (+9.8%) Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs cv2.resize rounding at mask boundaries). Unit tests pass (11/11). Changes (all flag-gated, opt-in): inference_models/models/rfdetr/triton_preprocess.py (new) One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true. inference_models/models/rfdetr/triton_postprocess.py (new) One Triton kernel fusing sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Replaces ~14k small cub/torch kernels with 431 (1 per frame). Supports both per-class threshold vector and scalar, with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true. inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py - Wire the Triton preprocess fast-path in pre_process() with a guarded dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop). - Cache pre-allocated input buffer and normalization constants on model instance on first call. - Replace pre_process_stream.synchronize() with a CUDA event ev.wait() on the inference stream so the CPU doesn't stall waiting for the preprocessing Triton kernel to finish. - Drop the post_process_stream.synchronize() (the adapter's subsequent .cpu() calls provide the implicit sync). inference_models/models/rfdetr/common.py Wire the Triton postprocess conf-filter into post_process_instance_segmentation_results. Falls back to torch path when the model has no remapping table, is CPU-bound, or Triton is unavailable. inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path) Same Triton preprocess kernel + dispatch for the legacy inference package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS default routes to inference_models adapters) but kept for parity so the legacy path benefits if exercised. inference/core/models/inference_models_adapters.py GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2)) on GPU, only DtoH + cv2.findContours non-empty masks. Gated via RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to the reference path. Env vars introduced: RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (W2) Adds triton_fullpostproc.py with two fused Triton kernels that replace the entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no static crop, stretch-to resize, class remapping active): _rfdetr_fullpost_filter_kernel (grid = num_queries) sigmoid argmax + class remap + conf threshold + cxcywh->xyxy + letterbox-denormalize + clip + round; atomic_add into counter to reserve a compact output slot. _rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x) GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 + uint8 emit. Early-exits on s >= counter[0] without an intermediate sync. Adapter (inference_models_adapters.py): - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event side-channels. Adapter stream waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined and mask to n_survivors and pinned-DtoH's both async, syncing again. - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the postproc stream. Same number of host-visible syncs (2), but the first is a 4-byte DtoH instead of a stream drain, and both are on a dedicated pinned path so the copy engine overlaps with the compute engine. TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py): - Records a produce_event on the graph's own stream so consumers can wait_event instead of stream.synchronize(). Removes the unconditional stream.synchronize() in infer_from_trt_engine's graph-replay branch. - consumer_done_event field on TRTCudaGraphState lets the next graph replay chain on the consumer's last use of the output buffers. - _trt_reuse_as_input_buffer marker so fast preproc can write directly into the graph's captured input buffer, eliminating the per-frame DtoD. Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine): v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS + triton_fullpost + deferred counter sync (this commit) 151 FPS Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy, conf, class_id, and mask MD5 per detection). Env flags: RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path

Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc path that shouldn't be there: - direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px) - vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538) direct_copy was class_mapping.to(dtype=torch.int32) firing every frame — upstream stores the mapping as int64, our Triton kernel needs int32, and the wrapper re-converts on every call since the dtype check always fails. Cache the converted view keyed by id(source_tensor). FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty for the three output scratch buffers. Moving to a persistent scratch cache keyed on (num_queries, device) drops 3 torch.empty allocator calls per frame and replaces torch.zeros with an explicit counter.zero_() (still launches FillFunctor — no safe way to inline into the filter kernel since concurrent blocks would race with the zero — but eliminates allocator pressure and stabilizes pointer values for the Triton JIT cache). After W7 the per-frame kernel launch count drops from 2 incidental-torch kernels to 1, the 3 allocator calls are eliminated, and the adapter sees stable-address scratch across frames (latent prerequisite for CUDA-graph capture of the postproc path). Impact: - direct_copy: 538 -> 0 (-100%) - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required) - torch.empty calls: 3/frame -> 0 - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px. - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the binding constraint, not mask kernel GPU time).

…312px) Opt-in depth-2 ping-pong of Triton post-processing scratch buffers so the adapter can drain frame N-1's results while frame N's kernels are in flight. Gated behind RFDETR_PIPELINE_DEPTH=2; default 1 preserves the current single-slot behavior bit-exactly. Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 TRT): depth=1 (default): 151 FPS (0-diff parity vs baseline) depth=2 (opt-in): 241 FPS (+59.5% vs depth=1, +121% vs 109 FPS raw baseline) Parity on depth=2: - Frame 0: empty response (warmup; no prior frame to drain) - Frames 1..N: bit-exact match to frames 0..N-1 of the depth=1 capture (0 mismatches across 537 shifted pairs on vehicles_312px) - Frame N+1: dropped (the last frame's detections live in the pending slot; no subsequent call drains them) Semantic changes (opt-in only): The InferencePipeline sink receives `on_prediction(predictions_N-1, video_frame_N)` — a 1-frame skew between detections and their paired image. At 241 FPS the skew is ~4 ms, imperceptible for annotation / dashboards / analytics. For strict frame-detection pairing, keep depth=1. Files: triton_fullpostproc.py - _get_scratch_buffers + _get_mask_bin_buffer now keyed by (num_queries, device, slot_idx). slot_idx=0 is the legacy single slot; slot_idx in {0, 1} for depth=2. - triton_rfdetr_fullpost accepts `num_slots` (default 1). When > 1 a module-level counter rotates slot assignment per call so the Triton kernels for frame N write to a different slot than the one the adapter is still reading for frame N-1. common.py - Reads RFDETR_PIPELINE_DEPTH and passes it as num_slots to triton_rfdetr_fullpost. inference_models_adapters.py - Factored the per-det response-build into _build_response_from_det. - Depth-2 branch: push current det+meta as pending, drain and return previous pending. First call returns an empty response. - Depth-1 path unchanged; passes the same _build_response_from_det helper so both paths produce bit-exact output for a given det. The pipelining works because the Triton postproc kernels and their DtoH now run on their own ping-pong slot while the TRT engine + preproc for the next frame kick off on the default / preproc streams. The binding constraint before W1 was `cudaStreamSynchronize` = 80.9 pct of CUDA API time: CPU was idle waiting for GPU. depth=2 overlaps that wait with the next frame's GPU work.

claude and others added 6 commits April 28, 2026 22:33

initial point

419c5a7

replace benchmark script with minimal InferencePipeline-based version

807ea1a

aseembits93 changed the title ~~perf(rfdetr-seg): depth-2 frame pipeline (151 -> 241 FPS on vehicles_312px, +59%)~~ perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px) Apr 30, 2026

fix for jetson

53a8ddf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24

perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24
aseembits93 wants to merge 7 commits into
mainfrom
perf/rfdetr-seg-pipeline-depth2

aseembits93 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit-by-commit

What's in

New Triton kernels

Wiring / dispatch

Scratch caching (W7)

Benchmark harness

Env vars (all opt-in)

How depth-2 works

Caveats (depth=2 only, clearly gated)

Parity verification

Test plan

Relationship to #22

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented Apr 29, 2026 •

edited

Loading