perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) by aseembits93 · Pull Request #22 · aseembits93/inference

aseembits93 · 2026-04-29T01:26:10Z

Summary

RF-DETR nano seg TensorRT e2e latency optimizations, benchmarked on vehicles_312px.mp4 (538 frames, 312×176) with the minimal InferencePipeline benchmark script.

Metric	Baseline	Current	Δ
Avg FPS (Tesla T4, FP16 TRT, vehicles_312px.mp4)	109	~151	+38%
Parity vs v16 best	—	0 diff across 538 frames	bit-exact

All new paths are opt-in behind env vars; defaults preserve current behavior.

Benchmark:

python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
    --video_reference /home/ubuntu/inference/vehicles_312px.mp4

Flags to enable:

RFDETR_USE_TRITON_PREPROC=true
RFDETR_TRITON_FULLPOSTPROC=true
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true

Commit-by-commit

#	Commit	Summary
1	`419c5a7` initial point	Branch starting point.
2	`9ec6a50` fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction	First landing: Triton preprocess kernel, Triton conf-filter postprocess, CUDA-graph enablement, drop pre/post stream syncs.
3	`807ea1a` replace benchmark script with minimal InferencePipeline-based version	Switch benchmark harness to a 116-line InferencePipeline driver so measurements match production code paths.
4	`5ff27d1` full Triton postproc fusion + deferred counter sync (W2)	Collapses the post-TRT chain into two Triton kernels (`_rfdetr_fullpost_filter_kernel` + `_rfdetr_fullpost_mask_kernel_compact`). Removes the CPU-blocking `counter.item()` in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under a `torch.cuda.Event` guard. Includes atomic-counter compaction, `_trt_reuse_as_input_buffer` marker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no more `stream.synchronize()` in the graph-replay branch).
5	`c1406a8` cache per-call scratch + class_mapping int32 view (W7)	Eliminates two stray per-frame kernel launches that nsys flagged: `direct_copy_kernel_cuda` (from `class_mapping.to(dtype=torch.int32)` every frame — now cached by `id()`) and the three `torch.empty` allocator calls per frame (now a persistent scratch cache keyed on `(num_queries, device)`). `counter.zero_()` still fires FillFunctor per frame (safe to inline into the filter kernel is not possible — concurrent blocks would race with the zero).

What's in

New Triton kernels

inference_models/models/rfdetr/triton_preprocess.py — fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.
inference_models/models/rfdetr/triton_postprocess.py — fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used when RFDETR_TRITON_POSTPROC=true alone.
inference_models/models/rfdetr/triton_fullpostproc.py (new in W2) — full post-process fusion:
- _rfdetr_fullpost_filter_kernel — per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding; tl.atomic_add(counter, 1) reserves a compact output slot.
- _rfdetr_fullpost_mask_kernel_compact — GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.
- Returns unsliced buffers + counter + done_event; adapter does the host-side slice after a single pinned counter DtoH.
inference/models/rfdetr/triton_preprocess.py — same preprocess kernel wired into the legacy inference/models/rfdetr path for parity.

Wiring / dispatch

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Triton preprocess fast-path in pre_process() with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop).
- Pre-allocated input buffer with _trt_reuse_as_input_buffer marker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy.
- Pinned host staging buffer (_fast_src_host_pinned) so copy_(non_blocking=True) actually runs async.
- Cross-stream event recording (_fast_preproc_event) instead of pre_process_stream.synchronize().
inference_models/models/rfdetr/common.py — dispatches RFDETR_TRITON_FULLPOSTPROC first (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch.
inference_models/models/common/trt.py:
- TRTCudaGraphState.consumer_done_event lets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers).
- _capture_cuda_graph(use_pre_processed_images_as_input_buffer=...) supports capturing with an externally-owned input tensor.
- Removed stream.synchronize() in infer_from_trt_engine's graph-replay branch; replaced with a produce_event recorded on the graph's own stream.
inference/core/models/inference_models_adapters.py — GPU fast-path for the _combined_gpu / _counter_gpu / _postproc_done_event side channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined[:n] + mask[:n] and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 via numpy.view(np.float32) for confidence.

Scratch caching (W7)

_SCRATCH_CACHE — combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by (num_queries, device).
_CLASS_MAPPING_INT32_CACHE — int32 view of the class_mapping tensor cached by id(source_tensor); previously re-converted every frame because upstream stores it as int64.

Benchmark harness

development/stream_interface/rfdetr_nano_seg_trt_workflow.py — replaced with a minimal 116-line InferencePipeline-based driver. Uses the roboflow_core/roboflow_instance_segmentation_model@v3 workflow block; single video source, single sink, no annotators/buffering/rate-limiting. --backend flag pre-import sets DISABLED_INFERENCE_MODELS_BACKENDS for unambiguous backend attribution.

Env vars (all opt-in)

Var	Default	Effect
`RFDETR_USE_TRITON_PREPROC`	off	Enable fused Triton preprocess kernel
`RFDETR_TRITON_POSTPROC`	off	Enable Triton conf-filter (partial postproc fusion)
`RFDETR_TRITON_FULLPOSTPROC`	off	Enable full postproc fusion (W2)
`RFDETR_GPU_POSTPROCESS`	on	GPU mask-emptiness skip
`RFDETR_DISABLE_GPU_PREPROC`	off	Opt-out for torch GPU preprocess branch
`ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND`	off	Enables `TRTCudaGraphCache`

Test plan

pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q — 11/11 pass
Benchmark on vehicles_312px.mp4 (538 frames, Tesla T4, FP16): ~151 FPS (baseline ~109)
Parity check on vehicles_312px — 0 diff across all 538 frames vs v16 best (bit-exact xyxy, confidence, class_id, per-detection mask MD5)
nsys: kernel count 170,530 → 2,931 (-98.3%) with CUDA graphs + full postproc fusion enabled; DtoD per-run 439 → 8 with _trt_reuse_as_input_buffer
Tracer confirms zero class_mapping.to() / direct_copy_kernel_cuda launches per frame after W7
Higher-resolution video (mask kernel compute scales with orig_h × orig_w)
Jetson Orin NX (pinned-memory + async-transfer wins should scale better)

What closed out

Supersedes #23 (same work, earlier snapshot).

Known remaining bubbles

From nsys on vehicles_best:

cudaStreamSynchronize is 80.9% of CUDA API time — CPU waits for GPU, GPU is not saturated. The serial per-frame dispatch is the next bottleneck.
_rfdetr_fullpost_mask_kernel_compact is 46.5% of GPU time (30 µs/frame). Microbenchmark suggests this kernel can drop to ~35 µs with BLOCK_H=8, BLOCK_W=128, num_warps=2, but end-to-end FPS doesn't change while the CPU dispatch is serial — GPU time already fits inside the sync wait.
Two follow-ups expected to meaningfully move FPS: (a) a 2-deep frame pipeline that overlaps frame N+1's HtoD with frame N's TRT graph, and (b) moving cv2.findContours off the critical path.

… stream sync reduction Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine, example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes below + enabling the existing CUDA-graph cache: Baseline (no changes) 93.07 FPS + Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6) + U7 mask-decode skip for empty masks ~94 FPS (flag-gated) + Triton postprocess conf-filter 98.6 FPS (+5.9%) + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%) + Drop pre/post stream syncs 102.2 FPS (+9.8%) Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs cv2.resize rounding at mask boundaries). Unit tests pass (11/11). Changes (all flag-gated, opt-in): inference_models/models/rfdetr/triton_preprocess.py (new) One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true. inference_models/models/rfdetr/triton_postprocess.py (new) One Triton kernel fusing sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Replaces ~14k small cub/torch kernels with 431 (1 per frame). Supports both per-class threshold vector and scalar, with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true. inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py - Wire the Triton preprocess fast-path in pre_process() with a guarded dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop). - Cache pre-allocated input buffer and normalization constants on model instance on first call. - Replace pre_process_stream.synchronize() with a CUDA event ev.wait() on the inference stream so the CPU doesn't stall waiting for the preprocessing Triton kernel to finish. - Drop the post_process_stream.synchronize() (the adapter's subsequent .cpu() calls provide the implicit sync). inference_models/models/rfdetr/common.py Wire the Triton postprocess conf-filter into post_process_instance_segmentation_results. Falls back to torch path when the model has no remapping table, is CPU-bound, or Triton is unavailable. inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path) Same Triton preprocess kernel + dispatch for the legacy inference package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS default routes to inference_models adapters) but kept for parity so the legacy path benefits if exercised. inference/core/models/inference_models_adapters.py GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2)) on GPU, only DtoH + cv2.findContours non-empty masks. Gated via RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to the reference path. Env vars introduced: RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).

… (W2) Adds triton_fullpostproc.py with two fused Triton kernels that replace the entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no static crop, stretch-to resize, class remapping active): _rfdetr_fullpost_filter_kernel (grid = num_queries) sigmoid argmax + class remap + conf threshold + cxcywh->xyxy + letterbox-denormalize + clip + round; atomic_add into counter to reserve a compact output slot. _rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x) GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 + uint8 emit. Early-exits on s >= counter[0] without an intermediate sync. Adapter (inference_models_adapters.py): - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event side-channels. Adapter stream waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined and mask to n_survivors and pinned-DtoH's both async, syncing again. - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the postproc stream. Same number of host-visible syncs (2), but the first is a 4-byte DtoH instead of a stream drain, and both are on a dedicated pinned path so the copy engine overlaps with the compute engine. TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py): - Records a produce_event on the graph's own stream so consumers can wait_event instead of stream.synchronize(). Removes the unconditional stream.synchronize() in infer_from_trt_engine's graph-replay branch. - consumer_done_event field on TRTCudaGraphState lets the next graph replay chain on the consumer's last use of the output buffers. - _trt_reuse_as_input_buffer marker so fast preproc can write directly into the graph's captured input buffer, eliminating the per-frame DtoD. Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine): v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS + triton_fullpost + deferred counter sync (this commit) 151 FPS Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy, conf, class_id, and mask MD5 per detection). Env flags: RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path

Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc path that shouldn't be there: - direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px) - vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538) direct_copy was class_mapping.to(dtype=torch.int32) firing every frame — upstream stores the mapping as int64, our Triton kernel needs int32, and the wrapper re-converts on every call since the dtype check always fails. Cache the converted view keyed by id(source_tensor). FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty for the three output scratch buffers. Moving to a persistent scratch cache keyed on (num_queries, device) drops 3 torch.empty allocator calls per frame and replaces torch.zeros with an explicit counter.zero_() (still launches FillFunctor — no safe way to inline into the filter kernel since concurrent blocks would race with the zero — but eliminates allocator pressure and stabilizes pointer values for the Triton JIT cache). After W7 the per-frame kernel launch count drops from 2 incidental-torch kernels to 1, the 3 allocator calls are eliminated, and the adapter sees stable-address scratch across frames (latent prerequisite for CUDA-graph capture of the postproc path). Impact: - direct_copy: 538 -> 0 (-100%) - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required) - torch.empty calls: 3/frame -> 0 - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px. - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the binding constraint, not mask kernel GPU time).

claude added 2 commits April 28, 2026 22:33

initial point

419c5a7

aseembits93 mentioned this pull request Apr 29, 2026

perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%) #23

Closed

6 tasks

aseembits93 added 3 commits April 29, 2026 18:26

replace benchmark script with minimal InferencePipeline-based version

807ea1a

aseembits93 changed the title ~~perf(rfdetr-seg): fused Triton pre/postproc kernels, GPU mask decode, sync reduction~~ perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22

perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22
aseembits93 wants to merge 5 commits into
mainfrom
optimize-rfdetr-seg

aseembits93 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit-by-commit

What's in

New Triton kernels

Wiring / dispatch

Scratch caching (W7)

Benchmark harness

Env vars (all opt-in)

Test plan

What closed out

Known remaining bubbles

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented Apr 29, 2026 •

edited

Loading