perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22
Open
aseembits93 wants to merge 5 commits into
Open
perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px)#22aseembits93 wants to merge 5 commits into
aseembits93 wants to merge 5 commits into
Conversation
… stream sync reduction
Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:
Baseline (no changes) 93.07 FPS
+ Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6)
+ U7 mask-decode skip for empty masks ~94 FPS (flag-gated)
+ Triton postprocess conf-filter 98.6 FPS (+5.9%)
+ ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%)
+ Drop pre/post stream syncs 102.2 FPS (+9.8%)
Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).
Changes (all flag-gated, opt-in):
inference_models/models/rfdetr/triton_preprocess.py (new)
One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.
inference_models/models/rfdetr/triton_postprocess.py (new)
One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
confidence-threshold filter. Replaces ~14k small cub/torch kernels with
431 (1 per frame). Supports both per-class threshold vector and scalar,
with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Wire the Triton preprocess fast-path in pre_process() with a guarded
dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
- Cache pre-allocated input buffer and normalization constants on model
instance on first call.
- Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
on the inference stream so the CPU doesn't stall waiting for the
preprocessing Triton kernel to finish.
- Drop the post_process_stream.synchronize() (the adapter's subsequent
.cpu() calls provide the implicit sync).
inference_models/models/rfdetr/common.py
Wire the Triton postprocess conf-filter into
post_process_instance_segmentation_results. Falls back to torch path
when the model has no remapping table, is CPU-bound, or Triton is
unavailable.
inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
Same Triton preprocess kernel + dispatch for the legacy inference
package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
default routes to inference_models adapters) but kept for parity so the
legacy path benefits if exercised.
inference/core/models/inference_models_adapters.py
GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
the reference path.
Env vars introduced:
RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel
RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter
RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip
RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closed
6 tasks
aseembits93
pushed a commit
that referenced
this pull request
Apr 29, 2026
…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aseembits93
added a commit
that referenced
this pull request
Apr 29, 2026
…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).
… (W2)
Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):
_rfdetr_fullpost_filter_kernel (grid = num_queries)
sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
letterbox-denormalize + clip + round; atomic_add into counter to reserve
a compact output slot.
_rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x)
GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.
Adapter (inference_models_adapters.py):
- New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
side-channels. Adapter stream waits the done_event, pinned-DtoH's the
4-byte counter, syncs once to read n_survivors, then slices combined
and mask to n_survivors and pinned-DtoH's both async, syncing again.
- Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
postproc stream. Same number of host-visible syncs (2), but the first
is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
pinned path so the copy engine overlaps with the compute engine.
TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):
- Records a produce_event on the graph's own stream so consumers can
wait_event instead of stream.synchronize(). Removes the unconditional
stream.synchronize() in infer_from_trt_engine's graph-replay branch.
- consumer_done_event field on TRTCudaGraphState lets the next graph
replay chain on the consumer's last use of the output buffers.
- _trt_reuse_as_input_buffer marker so fast preproc can write directly
into the graph's captured input buffer, eliminating the per-frame DtoD.
Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):
v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS
+ triton_fullpost + deferred counter sync (this commit) 151 FPS
Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).
Env flags:
RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:
- direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px)
- vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538)
direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).
FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).
After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).
Impact:
- direct_copy: 538 -> 0 (-100%)
- FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
- torch.empty calls: 3/frame -> 0
- Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
- End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
binding constraint, not mask kernel GPU time).
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RF-DETR nano seg TensorRT e2e latency optimizations, benchmarked on
vehicles_312px.mp4(538 frames, 312×176) with the minimal InferencePipeline benchmark script.All new paths are opt-in behind env vars; defaults preserve current behavior.
Benchmark:
python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \ --video_reference /home/ubuntu/inference/vehicles_312px.mp4Flags to enable:
RFDETR_USE_TRITON_PREPROC=trueRFDETR_TRITON_FULLPOSTPROC=trueENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=trueCommit-by-commit
419c5a7initial point9ec6a50fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction807ea1areplace benchmark script with minimal InferencePipeline-based version5ff27d1full Triton postproc fusion + deferred counter sync (W2)_rfdetr_fullpost_filter_kernel+_rfdetr_fullpost_mask_kernel_compact). Removes the CPU-blockingcounter.item()in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under atorch.cuda.Eventguard. Includes atomic-counter compaction,_trt_reuse_as_input_buffermarker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no morestream.synchronize()in the graph-replay branch).c1406a8cache per-call scratch + class_mapping int32 view (W7)direct_copy_kernel_cuda(fromclass_mapping.to(dtype=torch.int32)every frame — now cached byid()) and the threetorch.emptyallocator calls per frame (now a persistent scratch cache keyed on(num_queries, device)).counter.zero_()still fires FillFunctor per frame (safe to inline into the filter kernel is not possible — concurrent blocks would race with the zero).What's in
New Triton kernels
inference_models/models/rfdetr/triton_preprocess.py— fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.inference_models/models/rfdetr/triton_postprocess.py— fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used whenRFDETR_TRITON_POSTPROC=truealone.inference_models/models/rfdetr/triton_fullpostproc.py(new in W2) — full post-process fusion:_rfdetr_fullpost_filter_kernel— per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding;tl.atomic_add(counter, 1)reserves a compact output slot._rfdetr_fullpost_mask_kernel_compact— GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.counter+done_event; adapter does the host-side slice after a single pinned counter DtoH.inference/models/rfdetr/triton_preprocess.py— same preprocess kernel wired into the legacyinference/models/rfdetrpath for parity.Wiring / dispatch
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.pypre_process()with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop)._trt_reuse_as_input_buffermarker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy._fast_src_host_pinned) socopy_(non_blocking=True)actually runs async._fast_preproc_event) instead ofpre_process_stream.synchronize().inference_models/models/rfdetr/common.py— dispatchesRFDETR_TRITON_FULLPOSTPROCfirst (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch.inference_models/models/common/trt.py:TRTCudaGraphState.consumer_done_eventlets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers)._capture_cuda_graph(use_pre_processed_images_as_input_buffer=...)supports capturing with an externally-owned input tensor.stream.synchronize()ininfer_from_trt_engine's graph-replay branch; replaced with aproduce_eventrecorded on the graph's own stream.inference/core/models/inference_models_adapters.py— GPU fast-path for the_combined_gpu/_counter_gpu/_postproc_done_eventside channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to readn_survivors, then slicescombined[:n]+mask[:n]and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 vianumpy.view(np.float32)for confidence.Scratch caching (W7)
_SCRATCH_CACHE— combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by(num_queries, device)._CLASS_MAPPING_INT32_CACHE— int32 view of the class_mapping tensor cached byid(source_tensor); previously re-converted every frame because upstream stores it as int64.Benchmark harness
development/stream_interface/rfdetr_nano_seg_trt_workflow.py— replaced with a minimal 116-line InferencePipeline-based driver. Uses theroboflow_core/roboflow_instance_segmentation_model@v3workflow block; single video source, single sink, no annotators/buffering/rate-limiting.--backendflag pre-import setsDISABLED_INFERENCE_MODELS_BACKENDSfor unambiguous backend attribution.Env vars (all opt-in)
RFDETR_USE_TRITON_PREPROCRFDETR_TRITON_POSTPROCRFDETR_TRITON_FULLPOSTPROCRFDETR_GPU_POSTPROCESSRFDETR_DISABLE_GPU_PREPROCENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKENDTRTCudaGraphCacheTest plan
pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q— 11/11 pass_trt_reuse_as_input_bufferclass_mapping.to()/direct_copy_kernel_cudalaunches per frame after W7What closed out
Supersedes #23 (same work, earlier snapshot).
Known remaining bubbles
From nsys on vehicles_best:
cudaStreamSynchronizeis 80.9% of CUDA API time — CPU waits for GPU, GPU is not saturated. The serial per-frame dispatch is the next bottleneck._rfdetr_fullpost_mask_kernel_compactis 46.5% of GPU time (30 µs/frame). Microbenchmark suggests this kernel can drop to ~35 µs withBLOCK_H=8, BLOCK_W=128, num_warps=2, but end-to-end FPS doesn't change while the CPU dispatch is serial — GPU time already fits inside the sync wait.cv2.findContoursoff the critical path.