Skip to content

perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22

Open
aseembits93 wants to merge 5 commits into
mainfrom
optimize-rfdetr-seg
Open

perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22
aseembits93 wants to merge 5 commits into
mainfrom
optimize-rfdetr-seg

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

@aseembits93 aseembits93 commented Apr 29, 2026

Summary

RF-DETR nano seg TensorRT e2e latency optimizations, benchmarked on vehicles_312px.mp4 (538 frames, 312×176) with the minimal InferencePipeline benchmark script.

Metric Baseline Current Δ
Avg FPS (Tesla T4, FP16 TRT, vehicles_312px.mp4) 109 ~151 +38%
Parity vs v16 best 0 diff across 538 frames bit-exact

All new paths are opt-in behind env vars; defaults preserve current behavior.

Benchmark:

python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \
    --video_reference /home/ubuntu/inference/vehicles_312px.mp4

Flags to enable:

  • RFDETR_USE_TRITON_PREPROC=true
  • RFDETR_TRITON_FULLPOSTPROC=true
  • ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true

Commit-by-commit

# Commit Summary
1 419c5a7 initial point Branch starting point.
2 9ec6a50 fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction First landing: Triton preprocess kernel, Triton conf-filter postprocess, CUDA-graph enablement, drop pre/post stream syncs.
3 807ea1a replace benchmark script with minimal InferencePipeline-based version Switch benchmark harness to a 116-line InferencePipeline driver so measurements match production code paths.
4 5ff27d1 full Triton postproc fusion + deferred counter sync (W2) Collapses the post-TRT chain into two Triton kernels (_rfdetr_fullpost_filter_kernel + _rfdetr_fullpost_mask_kernel_compact). Removes the CPU-blocking counter.item() in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under a torch.cuda.Event guard. Includes atomic-counter compaction, _trt_reuse_as_input_buffer marker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no more stream.synchronize() in the graph-replay branch).
5 c1406a8 cache per-call scratch + class_mapping int32 view (W7) Eliminates two stray per-frame kernel launches that nsys flagged: direct_copy_kernel_cuda (from class_mapping.to(dtype=torch.int32) every frame — now cached by id()) and the three torch.empty allocator calls per frame (now a persistent scratch cache keyed on (num_queries, device)). counter.zero_() still fires FillFunctor per frame (safe to inline into the filter kernel is not possible — concurrent blocks would race with the zero).

What's in

New Triton kernels

  • inference_models/models/rfdetr/triton_preprocess.py — fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.
  • inference_models/models/rfdetr/triton_postprocess.py — fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used when RFDETR_TRITON_POSTPROC=true alone.
  • inference_models/models/rfdetr/triton_fullpostproc.py (new in W2) — full post-process fusion:
    • _rfdetr_fullpost_filter_kernel — per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding; tl.atomic_add(counter, 1) reserves a compact output slot.
    • _rfdetr_fullpost_mask_kernel_compact — GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.
    • Returns unsliced buffers + counter + done_event; adapter does the host-side slice after a single pinned counter DtoH.
  • inference/models/rfdetr/triton_preprocess.py — same preprocess kernel wired into the legacy inference/models/rfdetr path for parity.

Wiring / dispatch

  • inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
    • Triton preprocess fast-path in pre_process() with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop).
    • Pre-allocated input buffer with _trt_reuse_as_input_buffer marker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy.
    • Pinned host staging buffer (_fast_src_host_pinned) so copy_(non_blocking=True) actually runs async.
    • Cross-stream event recording (_fast_preproc_event) instead of pre_process_stream.synchronize().
  • inference_models/models/rfdetr/common.py — dispatches RFDETR_TRITON_FULLPOSTPROC first (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch.
  • inference_models/models/common/trt.py:
    • TRTCudaGraphState.consumer_done_event lets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers).
    • _capture_cuda_graph(use_pre_processed_images_as_input_buffer=...) supports capturing with an externally-owned input tensor.
    • Removed stream.synchronize() in infer_from_trt_engine's graph-replay branch; replaced with a produce_event recorded on the graph's own stream.
  • inference/core/models/inference_models_adapters.py — GPU fast-path for the _combined_gpu / _counter_gpu / _postproc_done_event side channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined[:n] + mask[:n] and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 via numpy.view(np.float32) for confidence.

Scratch caching (W7)

  • _SCRATCH_CACHE — combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by (num_queries, device).
  • _CLASS_MAPPING_INT32_CACHE — int32 view of the class_mapping tensor cached by id(source_tensor); previously re-converted every frame because upstream stores it as int64.

Benchmark harness

  • development/stream_interface/rfdetr_nano_seg_trt_workflow.py — replaced with a minimal 116-line InferencePipeline-based driver. Uses the roboflow_core/roboflow_instance_segmentation_model@v3 workflow block; single video source, single sink, no annotators/buffering/rate-limiting. --backend flag pre-import sets DISABLED_INFERENCE_MODELS_BACKENDS for unambiguous backend attribution.

Env vars (all opt-in)

Var Default Effect
RFDETR_USE_TRITON_PREPROC off Enable fused Triton preprocess kernel
RFDETR_TRITON_POSTPROC off Enable Triton conf-filter (partial postproc fusion)
RFDETR_TRITON_FULLPOSTPROC off Enable full postproc fusion (W2)
RFDETR_GPU_POSTPROCESS on GPU mask-emptiness skip
RFDETR_DISABLE_GPU_PREPROC off Opt-out for torch GPU preprocess branch
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND off Enables TRTCudaGraphCache

Test plan

  • pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q — 11/11 pass
  • Benchmark on vehicles_312px.mp4 (538 frames, Tesla T4, FP16): ~151 FPS (baseline ~109)
  • Parity check on vehicles_312px — 0 diff across all 538 frames vs v16 best (bit-exact xyxy, confidence, class_id, per-detection mask MD5)
  • nsys: kernel count 170,530 → 2,931 (-98.3%) with CUDA graphs + full postproc fusion enabled; DtoD per-run 439 → 8 with _trt_reuse_as_input_buffer
  • Tracer confirms zero class_mapping.to() / direct_copy_kernel_cuda launches per frame after W7
  • Higher-resolution video (mask kernel compute scales with orig_h × orig_w)
  • Jetson Orin NX (pinned-memory + async-transfer wins should scale better)

What closed out

Supersedes #23 (same work, earlier snapshot).

Known remaining bubbles

From nsys on vehicles_best:

  • cudaStreamSynchronize is 80.9% of CUDA API time — CPU waits for GPU, GPU is not saturated. The serial per-frame dispatch is the next bottleneck.
  • _rfdetr_fullpost_mask_kernel_compact is 46.5% of GPU time (30 µs/frame). Microbenchmark suggests this kernel can drop to ~35 µs with BLOCK_H=8, BLOCK_W=128, num_warps=2, but end-to-end FPS doesn't change while the CPU dispatch is serial — GPU time already fits inside the sync wait.
  • Two follow-ups expected to meaningfully move FPS: (a) a 2-deep frame pipeline that overlaps frame N+1's HtoD with frame N's TRT graph, and (b) moving cv2.findContours off the critical path.

claude added 2 commits April 28, 2026 22:33
… stream sync reduction

Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:

  Baseline (no changes)                            93.07 FPS
  + Triton preprocess (fused resize+BGR2RGB+norm)  ~93 FPS   (U6)
  + U7 mask-decode skip for empty masks            ~94 FPS   (flag-gated)
  + Triton postprocess conf-filter                  98.6 FPS (+5.9%)
  + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True  102.1 FPS  (+9.7%)
  + Drop pre/post stream syncs                    102.2 FPS  (+9.8%)

Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).

Changes (all flag-gated, opt-in):

inference_models/models/rfdetr/triton_preprocess.py (new)
  One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
  normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
  kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.

inference_models/models/rfdetr/triton_postprocess.py (new)
  One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
  confidence-threshold filter. Replaces ~14k small cub/torch kernels with
  431 (1 per frame). Supports both per-class threshold vector and scalar,
  with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
  - Wire the Triton preprocess fast-path in pre_process() with a guarded
    dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
  - Cache pre-allocated input buffer and normalization constants on model
    instance on first call.
  - Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
    on the inference stream so the CPU doesn't stall waiting for the
    preprocessing Triton kernel to finish.
  - Drop the post_process_stream.synchronize() (the adapter's subsequent
    .cpu() calls provide the implicit sync).

inference_models/models/rfdetr/common.py
  Wire the Triton postprocess conf-filter into
  post_process_instance_segmentation_results. Falls back to torch path
  when the model has no remapping table, is CPU-bound, or Triton is
  unavailable.

inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
  Same Triton preprocess kernel + dispatch for the legacy inference
  package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
  default routes to inference_models adapters) but kept for parity so the
  legacy path benefits if exercised.

inference/core/models/inference_models_adapters.py
  GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
  on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
  RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
  the reference path.

Env vars introduced:
  RFDETR_USE_TRITON_PREPROC=true         opt-in; fused preproc kernel
  RFDETR_TRITON_POSTPROC=true            opt-in; fused postproc conf filter
  RFDETR_GPU_POSTPROCESS=true            default on; GPU mask emptiness skip
  RFDETR_DISABLE_GPU_PREPROC=true        opt-out; disable torch GPU preproc
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True   enables existing TRT CUDA graph cache

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aseembits93 pushed a commit that referenced this pull request Apr 29, 2026
…on (+1.8% FPS)

Follow-up optimizations on the full-postproc fusion to eliminate the
downstream plumbing kernels still visible in nsys after PR #23.

Results (5-run avg on Tesla T4, example_video, 431 frames):

  Baseline                             93.07 FPS
  Partial fusion (PR #22)             102.74 FPS  (+10.3%)
  Full fusion v2 (PR #23 initial)     106.05 FPS  (+13.9%)
  Full fusion v3 (this commit)        107.94 FPS  (+15.9%)

Changes:

1. `triton_fullpostproc.py`:
   - Filter kernel: rounds + casts xyxy to int32 inside the kernel
     (replaces a downstream `.round().int()` elementwise kernel).
   - Mask kernel: accumulates per-survivor `mask_any` via tile-level
     `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))`
     reduce. Writes to compact `(n_survivors, H, W)` output instead of
     padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros`
     fill kernel.
   - Host wrapper: returns compact tensors directly (xyxy_int32,
     conf, cls_id, mask_bin uint8, mask_any bool), all sized to
     n_survivors. The single `keep.nonzero()` remains as the only DtoH
     sync in the fused path.

2. `common.py`:
   - Build `InstanceDetections` with compact tensors directly; skip
     the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing
     chain that added 5 kernels per frame.
   - Attach precomputed `mask_any` onto the detections object via
     `__dict__` so the adapter can use it without recomputation.

3. `inference_models_adapters.py`:
   - Use `getattr(det, "mask_any", None)` from the full-fusion path
     when available; fall back to `.any(dim=(1,2))` otherwise.

nsys kernel count shrinkage (per 431-frame run):
                          total_k   legacy_pp   pp_ms
  FULL v2                  12,315     10,900    60ms
  FULL v3                   7,026      4,669    17ms   (-43% total, -57% pp, -72% pp_ms)

Specifically eliminated:
- DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero()
  calls per frame collapsed to 1).
- FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros).
- or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton).
- round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel).
- direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select
  on scalar fields, but they're tiny).

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames
as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aseembits93 added a commit that referenced this pull request Apr 29, 2026
…on (+1.8% FPS)

Follow-up optimizations on the full-postproc fusion to eliminate the
downstream plumbing kernels still visible in nsys after PR #23.

Results (5-run avg on Tesla T4, example_video, 431 frames):

  Baseline                             93.07 FPS
  Partial fusion (PR #22)             102.74 FPS  (+10.3%)
  Full fusion v2 (PR #23 initial)     106.05 FPS  (+13.9%)
  Full fusion v3 (this commit)        107.94 FPS  (+15.9%)

Changes:

1. `triton_fullpostproc.py`:
   - Filter kernel: rounds + casts xyxy to int32 inside the kernel
     (replaces a downstream `.round().int()` elementwise kernel).
   - Mask kernel: accumulates per-survivor `mask_any` via tile-level
     `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))`
     reduce. Writes to compact `(n_survivors, H, W)` output instead of
     padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros`
     fill kernel.
   - Host wrapper: returns compact tensors directly (xyxy_int32,
     conf, cls_id, mask_bin uint8, mask_any bool), all sized to
     n_survivors. The single `keep.nonzero()` remains as the only DtoH
     sync in the fused path.

2. `common.py`:
   - Build `InstanceDetections` with compact tensors directly; skip
     the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing
     chain that added 5 kernels per frame.
   - Attach precomputed `mask_any` onto the detections object via
     `__dict__` so the adapter can use it without recomputation.

3. `inference_models_adapters.py`:
   - Use `getattr(det, "mask_any", None)` from the full-fusion path
     when available; fall back to `.any(dim=(1,2))` otherwise.

nsys kernel count shrinkage (per 431-frame run):
                          total_k   legacy_pp   pp_ms
  FULL v2                  12,315     10,900    60ms
  FULL v3                   7,026      4,669    17ms   (-43% total, -57% pp, -72% pp_ms)

Specifically eliminated:
- DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero()
  calls per frame collapsed to 1).
- FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros).
- or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton).
- round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel).
- direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select
  on scalar fields, but they're tiny).

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames
as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).
… (W2)

Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):

  _rfdetr_fullpost_filter_kernel  (grid = num_queries)
    sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
    letterbox-denormalize + clip + round; atomic_add into counter to reserve
    a compact output slot.

  _rfdetr_fullpost_mask_kernel_compact  (grid = num_queries * tile_y * tile_x)
    GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
    uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.

Adapter (inference_models_adapters.py):

  - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
    side-channels. Adapter stream waits the done_event, pinned-DtoH's the
    4-byte counter, syncs once to read n_survivors, then slices combined
    and mask to n_survivors and pinned-DtoH's both async, syncing again.

  - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
    postproc stream. Same number of host-visible syncs (2), but the first
    is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
    pinned path so the copy engine overlaps with the compute engine.

TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):

  - Records a produce_event on the graph's own stream so consumers can
    wait_event instead of stream.synchronize(). Removes the unconditional
    stream.synchronize() in infer_from_trt_engine's graph-replay branch.

  - consumer_done_event field on TRTCudaGraphState lets the next graph
    replay chain on the consumer's last use of the output buffers.

  - _trt_reuse_as_input_buffer marker so fast preproc can write directly
    into the graph's captured input buffer, eliminating the per-frame DtoD.

Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):

  v16 baseline (Triton preproc + postproc + CUDA graph)   150 FPS
  + triton_fullpost + deferred counter sync (this commit)  151 FPS

Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).

Env flags:

  RFDETR_TRITON_FULLPOSTPROC=true   opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:

  - direct_copy_kernel_cuda  (538 per 538-frame run on vehicles_312px)
  - vectorized_elementwise_kernel<FillFunctor<int>>  (538 / 538)

direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).

FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).

After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).

Impact:
  - direct_copy: 538 -> 0 (-100%)
  - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
  - torch.empty calls: 3/frame -> 0
  - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
  - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
    binding constraint, not mask kernel GPU time).
@aseembits93 aseembits93 changed the title perf(rfdetr-seg): fused Triton pre/postproc kernels, GPU mask decode, sync reduction perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants