Skip to content

perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24

Open
aseembits93 wants to merge 7 commits into
mainfrom
perf/rfdetr-seg-pipeline-depth2
Open

perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24
aseembits93 wants to merge 7 commits into
mainfrom
perf/rfdetr-seg-pipeline-depth2

Conversation

@aseembits93

@aseembits93 aseembits93 commented Apr 29, 2026

Copy link
Copy Markdown
Owner

Summary

Standalone PR: RF-DETR nano seg TensorRT e2e latency optimizations through to a depth-2 frame pipeline, benchmarked on vehicles_312px.mp4 (538 frames, 312×176) with the minimal InferencePipeline benchmark script.

Metric Baseline depth=1 (default after this PR) depth=2 (opt-in via RFDETR_PIPELINE_DEPTH=2)
Avg FPS (Tesla T4, FP16 TRT) 109 151 (+38%) 241 (+121%)
Parity vs v16 best 0 diff across 538 frames 0 diff on 537 shifted pairs

All new paths are opt-in behind env vars; defaults preserve current behavior (except for the two always-on CPU-side fixes noted below).

Benchmark:

python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
    --video_reference /home/ubuntu/inference/vehicles_312px.mp4

Flags to enable (depth=1, matches 151 FPS row):

  • RFDETR_USE_TRITON_PREPROC=true
  • RFDETR_TRITON_FULLPOSTPROC=true
  • ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true

Flags to enable (depth=2, matches 241 FPS row): above three plus RFDETR_PIPELINE_DEPTH=2.


Commit-by-commit

# Commit Summary
1 419c5a7 initial point Branch starting point.
2 9ec6a50 fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction First landing: Triton preprocess kernel, Triton conf-filter postprocess, CUDA-graph enablement, drop pre/post stream syncs.
3 807ea1a replace benchmark script with minimal InferencePipeline-based version Switch benchmark harness to a 116-line InferencePipeline driver so measurements match production code paths.
4 5ff27d1 full Triton postproc fusion + deferred counter sync (W2) Collapses the post-TRT chain into two Triton kernels (_rfdetr_fullpost_filter_kernel + _rfdetr_fullpost_mask_kernel_compact). Removes the CPU-blocking counter.item() in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under a torch.cuda.Event guard. Includes atomic-counter compaction, _trt_reuse_as_input_buffer marker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no more stream.synchronize() in the graph-replay branch).
5 c1406a8 cache per-call scratch + class_mapping int32 view (W7) Eliminates two stray per-frame kernel launches that nsys flagged: direct_copy_kernel_cuda (from class_mapping.to(dtype=torch.int32) every frame — now cached by id()) and the three torch.empty allocator calls per frame (now a persistent scratch cache keyed on (num_queries, device)).
6 1b169b7 depth-2 frame pipeline (151 → 241 FPS) Opt-in ping-pong of Triton postproc scratch slots so the adapter drains frame N-1's results while frame N's kernels are launching. Gated behind RFDETR_PIPELINE_DEPTH=2; default 1 is bit-exact-unchanged.

What's in

New Triton kernels

  • inference_models/models/rfdetr/triton_preprocess.py — fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.
  • inference_models/models/rfdetr/triton_postprocess.py — fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used when RFDETR_TRITON_POSTPROC=true alone.
  • inference_models/models/rfdetr/triton_fullpostproc.py — full post-process fusion:
    • _rfdetr_fullpost_filter_kernel — per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding; tl.atomic_add(counter, 1) reserves a compact output slot.
    • _rfdetr_fullpost_mask_kernel_compact — GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.
    • Returns unsliced buffers + counter + done_event; adapter does the host-side slice after a single pinned counter DtoH.
    • depth-2 extension: _get_scratch_buffers / _get_mask_bin_buffer keyed by (num_queries, device, slot_idx); triton_rfdetr_fullpost accepts num_slots (default 1 = legacy single-slot).
  • inference/models/rfdetr/triton_preprocess.py — same preprocess kernel wired into the legacy inference/models/rfdetr path for parity.

Wiring / dispatch

  • inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
    • Triton preprocess fast-path in pre_process() with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop).
    • Pre-allocated input buffer with _trt_reuse_as_input_buffer marker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy.
    • Pinned host staging buffer (_fast_src_host_pinned) so copy_(non_blocking=True) actually runs async.
    • Cross-stream event recording (_fast_preproc_event) instead of pre_process_stream.synchronize().
  • inference_models/models/rfdetr/common.py — dispatches RFDETR_TRITON_FULLPOSTPROC first (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch. Reads RFDETR_PIPELINE_DEPTH and passes num_slots=RFDETR_PIPELINE_DEPTH to triton_rfdetr_fullpost.
  • inference_models/models/common/trt.py:
    • TRTCudaGraphState.consumer_done_event lets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers).
    • _capture_cuda_graph(use_pre_processed_images_as_input_buffer=...) supports capturing with an externally-owned input tensor.
    • Removed stream.synchronize() in infer_from_trt_engine's graph-replay branch; replaced with a produce_event recorded on the graph's own stream.
  • inference/core/models/inference_models_adapters.py
    • GPU fast-path for the _combined_gpu / _counter_gpu / _postproc_done_event side channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined[:n] + mask[:n] and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 via numpy.view(np.float32) for confidence.
    • Factored per-det response-build into _build_response_from_det (used by both depth=1 and depth=2 code paths for bit-exact responses).
    • depth=2 branch: saves (det, H, W, class_filter) as _pipeline_pending so the scratch slot stays alive; drains previous pending (waits its done_event, pinned-DtoH counter+combined+mask, builds response, returns frame N-1's response). First call returns an empty response.

Scratch caching (W7)

  • _SCRATCH_CACHE — combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by (num_queries, device) (+ slot_idx for depth=2).
  • _CLASS_MAPPING_INT32_CACHE — int32 view of the class_mapping tensor cached by id(source_tensor); previously re-converted every frame because upstream stores it as int64.

Benchmark harness

  • development/stream_interface/rfdetr_nano_seg_trt_workflow.py — replaced with a minimal 116-line InferencePipeline-based driver. Uses the roboflow_core/roboflow_instance_segmentation_model@v3 workflow block; single video source, single sink, no annotators/buffering/rate-limiting. --backend flag pre-import sets DISABLED_INFERENCE_MODELS_BACKENDS for unambiguous backend attribution.

Env vars (all opt-in)

Var Default Effect
RFDETR_USE_TRITON_PREPROC off Enable fused Triton preprocess kernel
RFDETR_TRITON_POSTPROC off Enable Triton conf-filter (partial postproc fusion)
RFDETR_TRITON_FULLPOSTPROC off Enable full postproc fusion (W2)
RFDETR_GPU_POSTPROCESS on GPU mask-emptiness skip
RFDETR_DISABLE_GPU_PREPROC off Opt-out for torch GPU preprocess branch
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND off Enables TRTCudaGraphCache
RFDETR_PIPELINE_DEPTH 1 Set to 2 to enable depth-2 frame pipeline (see caveats below)

How depth-2 works

Before depth-2, the inference thread was serial per frame:

HtoD -> TRT -> Triton postproc -> counter DtoH (sync) -> mask DtoH (sync) -> build response -> return

cudaStreamSynchronize was 80.9% of CUDA API time per nsys — the CPU sat idle waiting for GPU work to complete, then blocked on DtoH.

Depth-2 ping-pongs the Triton postproc scratch buffers between two slots so the adapter can drain frame N-1's results while frame N's kernels are launching into the other slot. The cudaStreamSynchronize cost is now overlapped with the next frame's GPU work instead of blocking the workflow thread.

Per-frame flow (depth=2):

  1. Frame N's Triton kernels write to scratch slot N % 2.
  2. Adapter's postprocess() receives frame N's detections.
  3. Instead of draining frame N's scratch, it:
    • Saves (det, H, W, class_filter) as the new _pipeline_pending — the refs keep slot N % 2 alive.
    • Drains the previous pending (slot (N-1) % 2): waits its done_event, pinned-DtoH counter+combined+mask, builds response.
    • Returns frame N-1's response.
  4. First call (frame 0): no previous pending → returns an empty response.

The scratch slot that frame N writes is different from the slot frame N-1 is reading, so the two frames' work genuinely runs in parallel on the GPU.

Caveats (depth=2 only, clearly gated)

Aspect depth=1 depth=2
Response-to-frame binding exact shifted by 1 (response_N-1 paired with video_frame_N in on_prediction)
Frame 0 normal response empty response
Final frame emitted dropped (no subsequent call to drain its pending slot)
Per-frame wall time on workflow thread ~6.6 ms ~4.1 ms

At 241 FPS the 1-frame skew is ~4 ms between a detection's underlying pixels and the video_frame it's paired with in the sink callback. Imperceptible for annotation rendering, dashboards, downstream analytics. For strict frame-detection pairing (e.g. training label generation), keep depth=1.

The final-frame drop is a real loss (one Prediction per stream). If that matters, an explicit flush() method could be added on adapter teardown — not plumbed in this PR because InferencePipeline doesn't currently call into the model on shutdown.


Parity verification

  • depth=1 (default): 0 diff across all 538 frames vs v16 best (bit-exact xyxy, confidence, class_id, per-detection mask MD5).
  • depth=2:
    • Frame 0: empty response — matches "warmup slot" design.
    • Frames 1..537: bit-exact match to frames 0..536 of the depth=1 capture. Zero mismatches across 537 shifted pairs.
    • Frame 537 of depth=1 (1 detection, c2:0.9024:68,153,105,176): dropped in depth=2 (stream ended before next call could drain its pending slot).

Test plan

  • pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q — 11/11 pass
  • depth=1 benchmark: 151 FPS (baseline 109). No regression.
  • depth=1 parity: 0 diff across 538 frames vs baseline.
  • depth=2 benchmark: 241 / 240 / 242 FPS across 3 runs (stable).
  • depth=2 parity: 0 mismatches on 537 shifted pairs. Empty frame 0 + dropped final frame confirmed as designed.
  • nsys: kernel count 170,530 → 2,931 (-98.3%) with CUDA graphs + full postproc fusion enabled; DtoD per-run 439 → 8 with _trt_reuse_as_input_buffer.
  • Tracer confirms zero class_mapping.to() / direct_copy_kernel_cuda launches per frame after W7.
  • Higher-resolution video (mask kernel compute scales with orig_h × orig_w; pipelining win should scale with DtoH size).
  • Multi-batch workflow (depth=2 currently gated to len(detections_list) == 1 — larger batches fall through to depth=1).
  • Jetson Orin NX (pinned-memory + async-transfer + pipelining wins should scale better).

Relationship to #22

This PR now subsumes #22 — it contains all the commits from #22 plus the depth-2 pipeline commit on top. Merging this PR makes #22 redundant; #22 can be closed once this lands. If you prefer to land the 151-FPS work first without depth-2, #22 is still the smaller PR and this one can be rebased on main afterward.

claude and others added 6 commits April 28, 2026 22:33
… stream sync reduction

Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:

  Baseline (no changes)                            93.07 FPS
  + Triton preprocess (fused resize+BGR2RGB+norm)  ~93 FPS   (U6)
  + U7 mask-decode skip for empty masks            ~94 FPS   (flag-gated)
  + Triton postprocess conf-filter                  98.6 FPS (+5.9%)
  + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True  102.1 FPS  (+9.7%)
  + Drop pre/post stream syncs                    102.2 FPS  (+9.8%)

Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).

Changes (all flag-gated, opt-in):

inference_models/models/rfdetr/triton_preprocess.py (new)
  One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
  normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
  kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.

inference_models/models/rfdetr/triton_postprocess.py (new)
  One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
  confidence-threshold filter. Replaces ~14k small cub/torch kernels with
  431 (1 per frame). Supports both per-class threshold vector and scalar,
  with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
  - Wire the Triton preprocess fast-path in pre_process() with a guarded
    dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
  - Cache pre-allocated input buffer and normalization constants on model
    instance on first call.
  - Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
    on the inference stream so the CPU doesn't stall waiting for the
    preprocessing Triton kernel to finish.
  - Drop the post_process_stream.synchronize() (the adapter's subsequent
    .cpu() calls provide the implicit sync).

inference_models/models/rfdetr/common.py
  Wire the Triton postprocess conf-filter into
  post_process_instance_segmentation_results. Falls back to torch path
  when the model has no remapping table, is CPU-bound, or Triton is
  unavailable.

inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
  Same Triton preprocess kernel + dispatch for the legacy inference
  package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
  default routes to inference_models adapters) but kept for parity so the
  legacy path benefits if exercised.

inference/core/models/inference_models_adapters.py
  GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
  on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
  RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
  the reference path.

Env vars introduced:
  RFDETR_USE_TRITON_PREPROC=true         opt-in; fused preproc kernel
  RFDETR_TRITON_POSTPROC=true            opt-in; fused postproc conf filter
  RFDETR_GPU_POSTPROCESS=true            default on; GPU mask emptiness skip
  RFDETR_DISABLE_GPU_PREPROC=true        opt-out; disable torch GPU preproc
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True   enables existing TRT CUDA graph cache

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (W2)

Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):

  _rfdetr_fullpost_filter_kernel  (grid = num_queries)
    sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
    letterbox-denormalize + clip + round; atomic_add into counter to reserve
    a compact output slot.

  _rfdetr_fullpost_mask_kernel_compact  (grid = num_queries * tile_y * tile_x)
    GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
    uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.

Adapter (inference_models_adapters.py):

  - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
    side-channels. Adapter stream waits the done_event, pinned-DtoH's the
    4-byte counter, syncs once to read n_survivors, then slices combined
    and mask to n_survivors and pinned-DtoH's both async, syncing again.

  - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
    postproc stream. Same number of host-visible syncs (2), but the first
    is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
    pinned path so the copy engine overlaps with the compute engine.

TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):

  - Records a produce_event on the graph's own stream so consumers can
    wait_event instead of stream.synchronize(). Removes the unconditional
    stream.synchronize() in infer_from_trt_engine's graph-replay branch.

  - consumer_done_event field on TRTCudaGraphState lets the next graph
    replay chain on the consumer's last use of the output buffers.

  - _trt_reuse_as_input_buffer marker so fast preproc can write directly
    into the graph's captured input buffer, eliminating the per-frame DtoD.

Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):

  v16 baseline (Triton preproc + postproc + CUDA graph)   150 FPS
  + triton_fullpost + deferred counter sync (this commit)  151 FPS

Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).

Env flags:

  RFDETR_TRITON_FULLPOSTPROC=true   opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:

  - direct_copy_kernel_cuda  (538 per 538-frame run on vehicles_312px)
  - vectorized_elementwise_kernel<FillFunctor<int>>  (538 / 538)

direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).

FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).

After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).

Impact:
  - direct_copy: 538 -> 0 (-100%)
  - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
  - torch.empty calls: 3/frame -> 0
  - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
  - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
    binding constraint, not mask kernel GPU time).
…312px)

Opt-in depth-2 ping-pong of Triton post-processing scratch buffers so the
adapter can drain frame N-1's results while frame N's kernels are in
flight. Gated behind RFDETR_PIPELINE_DEPTH=2; default 1 preserves the
current single-slot behavior bit-exactly.

Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 TRT):

  depth=1 (default):  151 FPS  (0-diff parity vs baseline)
  depth=2 (opt-in):   241 FPS  (+59.5% vs depth=1, +121% vs 109 FPS raw baseline)

Parity on depth=2:
  - Frame 0:      empty response (warmup; no prior frame to drain)
  - Frames 1..N:  bit-exact match to frames 0..N-1 of the depth=1 capture
                  (0 mismatches across 537 shifted pairs on vehicles_312px)
  - Frame N+1:    dropped (the last frame's detections live in the pending
                  slot; no subsequent call drains them)

Semantic changes (opt-in only):

  The InferencePipeline sink receives `on_prediction(predictions_N-1,
  video_frame_N)` — a 1-frame skew between detections and their paired
  image. At 241 FPS the skew is ~4 ms, imperceptible for annotation /
  dashboards / analytics. For strict frame-detection pairing, keep
  depth=1.

Files:

  triton_fullpostproc.py
    - _get_scratch_buffers + _get_mask_bin_buffer now keyed by
      (num_queries, device, slot_idx). slot_idx=0 is the legacy single
      slot; slot_idx in {0, 1} for depth=2.
    - triton_rfdetr_fullpost accepts `num_slots` (default 1). When > 1 a
      module-level counter rotates slot assignment per call so the Triton
      kernels for frame N write to a different slot than the one the
      adapter is still reading for frame N-1.

  common.py
    - Reads RFDETR_PIPELINE_DEPTH and passes it as num_slots to
      triton_rfdetr_fullpost.

  inference_models_adapters.py
    - Factored the per-det response-build into _build_response_from_det.
    - Depth-2 branch: push current det+meta as pending, drain and return
      previous pending. First call returns an empty response.
    - Depth-1 path unchanged; passes the same _build_response_from_det
      helper so both paths produce bit-exact output for a given det.

The pipelining works because the Triton postproc kernels and their DtoH
now run on their own ping-pong slot while the TRT engine + preproc for
the next frame kick off on the default / preproc streams. The binding
constraint before W1 was `cudaStreamSynchronize` = 80.9 pct of CUDA API
time: CPU was idle waiting for GPU. depth=2 overlaps that wait with the
next frame's GPU work.
@aseembits93 aseembits93 changed the title perf(rfdetr-seg): depth-2 frame pipeline (151 -> 241 FPS on vehicles_312px, +59%) perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px) Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants