perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24
Open
aseembits93 wants to merge 7 commits into
Open
perf(rfdetr-seg): Triton fusion + CUDA graphs + depth-2 frame pipeline (109 → 241 FPS on vehicles_312px)#24aseembits93 wants to merge 7 commits into
aseembits93 wants to merge 7 commits into
Conversation
… stream sync reduction
Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:
Baseline (no changes) 93.07 FPS
+ Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6)
+ U7 mask-decode skip for empty masks ~94 FPS (flag-gated)
+ Triton postprocess conf-filter 98.6 FPS (+5.9%)
+ ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%)
+ Drop pre/post stream syncs 102.2 FPS (+9.8%)
Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).
Changes (all flag-gated, opt-in):
inference_models/models/rfdetr/triton_preprocess.py (new)
One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.
inference_models/models/rfdetr/triton_postprocess.py (new)
One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
confidence-threshold filter. Replaces ~14k small cub/torch kernels with
431 (1 per frame). Supports both per-class threshold vector and scalar,
with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Wire the Triton preprocess fast-path in pre_process() with a guarded
dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
- Cache pre-allocated input buffer and normalization constants on model
instance on first call.
- Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
on the inference stream so the CPU doesn't stall waiting for the
preprocessing Triton kernel to finish.
- Drop the post_process_stream.synchronize() (the adapter's subsequent
.cpu() calls provide the implicit sync).
inference_models/models/rfdetr/common.py
Wire the Triton postprocess conf-filter into
post_process_instance_segmentation_results. Falls back to torch path
when the model has no remapping table, is CPU-bound, or Triton is
unavailable.
inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
Same Triton preprocess kernel + dispatch for the legacy inference
package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
default routes to inference_models adapters) but kept for parity so the
legacy path benefits if exercised.
inference/core/models/inference_models_adapters.py
GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
the reference path.
Env vars introduced:
RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel
RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter
RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip
RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (W2)
Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):
_rfdetr_fullpost_filter_kernel (grid = num_queries)
sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
letterbox-denormalize + clip + round; atomic_add into counter to reserve
a compact output slot.
_rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x)
GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.
Adapter (inference_models_adapters.py):
- New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
side-channels. Adapter stream waits the done_event, pinned-DtoH's the
4-byte counter, syncs once to read n_survivors, then slices combined
and mask to n_survivors and pinned-DtoH's both async, syncing again.
- Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
postproc stream. Same number of host-visible syncs (2), but the first
is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
pinned path so the copy engine overlaps with the compute engine.
TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):
- Records a produce_event on the graph's own stream so consumers can
wait_event instead of stream.synchronize(). Removes the unconditional
stream.synchronize() in infer_from_trt_engine's graph-replay branch.
- consumer_done_event field on TRTCudaGraphState lets the next graph
replay chain on the consumer's last use of the output buffers.
- _trt_reuse_as_input_buffer marker so fast preproc can write directly
into the graph's captured input buffer, eliminating the per-frame DtoD.
Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):
v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS
+ triton_fullpost + deferred counter sync (this commit) 151 FPS
Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).
Env flags:
RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:
- direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px)
- vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538)
direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).
FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).
After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).
Impact:
- direct_copy: 538 -> 0 (-100%)
- FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
- torch.empty calls: 3/frame -> 0
- Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
- End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
binding constraint, not mask kernel GPU time).
…312px)
Opt-in depth-2 ping-pong of Triton post-processing scratch buffers so the
adapter can drain frame N-1's results while frame N's kernels are in
flight. Gated behind RFDETR_PIPELINE_DEPTH=2; default 1 preserves the
current single-slot behavior bit-exactly.
Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 TRT):
depth=1 (default): 151 FPS (0-diff parity vs baseline)
depth=2 (opt-in): 241 FPS (+59.5% vs depth=1, +121% vs 109 FPS raw baseline)
Parity on depth=2:
- Frame 0: empty response (warmup; no prior frame to drain)
- Frames 1..N: bit-exact match to frames 0..N-1 of the depth=1 capture
(0 mismatches across 537 shifted pairs on vehicles_312px)
- Frame N+1: dropped (the last frame's detections live in the pending
slot; no subsequent call drains them)
Semantic changes (opt-in only):
The InferencePipeline sink receives `on_prediction(predictions_N-1,
video_frame_N)` — a 1-frame skew between detections and their paired
image. At 241 FPS the skew is ~4 ms, imperceptible for annotation /
dashboards / analytics. For strict frame-detection pairing, keep
depth=1.
Files:
triton_fullpostproc.py
- _get_scratch_buffers + _get_mask_bin_buffer now keyed by
(num_queries, device, slot_idx). slot_idx=0 is the legacy single
slot; slot_idx in {0, 1} for depth=2.
- triton_rfdetr_fullpost accepts `num_slots` (default 1). When > 1 a
module-level counter rotates slot assignment per call so the Triton
kernels for frame N write to a different slot than the one the
adapter is still reading for frame N-1.
common.py
- Reads RFDETR_PIPELINE_DEPTH and passes it as num_slots to
triton_rfdetr_fullpost.
inference_models_adapters.py
- Factored the per-det response-build into _build_response_from_det.
- Depth-2 branch: push current det+meta as pending, drain and return
previous pending. First call returns an empty response.
- Depth-1 path unchanged; passes the same _build_response_from_det
helper so both paths produce bit-exact output for a given det.
The pipelining works because the Triton postproc kernels and their DtoH
now run on their own ping-pong slot while the TRT engine + preproc for
the next frame kick off on the default / preproc streams. The binding
constraint before W1 was `cudaStreamSynchronize` = 80.9 pct of CUDA API
time: CPU was idle waiting for GPU. depth=2 overlaps that wait with the
next frame's GPU work.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standalone PR: RF-DETR nano seg TensorRT e2e latency optimizations through to a depth-2 frame pipeline, benchmarked on
vehicles_312px.mp4(538 frames, 312×176) with the minimal InferencePipeline benchmark script.RFDETR_PIPELINE_DEPTH=2)All new paths are opt-in behind env vars; defaults preserve current behavior (except for the two always-on CPU-side fixes noted below).
Benchmark:
python development/stream_interface/rfdetr_nano_seg_trt_workflow.py \ --video_reference /home/ubuntu/inference/vehicles_312px.mp4Flags to enable (depth=1, matches 151 FPS row):
RFDETR_USE_TRITON_PREPROC=trueRFDETR_TRITON_FULLPOSTPROC=trueENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=trueFlags to enable (depth=2, matches 241 FPS row): above three plus
RFDETR_PIPELINE_DEPTH=2.Commit-by-commit
419c5a7initial point9ec6a50fused Triton pre/postproc kernels, GPU mask decode, stream sync reduction807ea1areplace benchmark script with minimal InferencePipeline-based version5ff27d1full Triton postproc fusion + deferred counter sync (W2)_rfdetr_fullpost_filter_kernel+_rfdetr_fullpost_mask_kernel_compact). Removes the CPU-blockingcounter.item()in the Triton wrapper; counter is now pinned-DtoH'd from the adapter under atorch.cuda.Eventguard. Includes atomic-counter compaction,_trt_reuse_as_input_buffermarker (eliminates per-frame DtoD from preproc output → graph input buffer), pinned async DtoH, cross-stream event plumbing (no morestream.synchronize()in the graph-replay branch).c1406a8cache per-call scratch + class_mapping int32 view (W7)direct_copy_kernel_cuda(fromclass_mapping.to(dtype=torch.int32)every frame — now cached byid()) and the threetorch.emptyallocator calls per frame (now a persistent scratch cache keyed on(num_queries, device)).1b169b7depth-2 frame pipeline (151 → 241 FPS)RFDETR_PIPELINE_DEPTH=2; default 1 is bit-exact-unchanged.What's in
New Triton kernels
inference_models/models/rfdetr/triton_preprocess.py— fused stretch-to resize + BGR→RGB + /255 + ImageNet normalize. Replaces ~8 torch CUDA kernels with 1.inference_models/models/rfdetr/triton_postprocess.py— fused sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Used whenRFDETR_TRITON_POSTPROC=truealone.inference_models/models/rfdetr/triton_fullpostproc.py— full post-process fusion:_rfdetr_fullpost_filter_kernel— per-query: sigmoid argmax + class remap + conf threshold + cxcywh→xyxy + letterbox-denormalize + clip + banker's rounding;tl.atomic_add(counter, 1)reserves a compact output slot._rfdetr_fullpost_mask_kernel_compact— GPU-side bilinear upsample 78×78 → orig_h × orig_w + threshold > 0 + uint8 emit. Reads counter on GPU for early-exit so no CPU sync between the two launches.counter+done_event; adapter does the host-side slice after a single pinned counter DtoH._get_scratch_buffers/_get_mask_bin_bufferkeyed by(num_queries, device, slot_idx);triton_rfdetr_fullpostacceptsnum_slots(default 1 = legacy single-slot).inference/models/rfdetr/triton_preprocess.py— same preprocess kernel wired into the legacyinference/models/rfdetrpath for parity.Wiring / dispatch
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.pypre_process()with strict guard (STRETCH_TO, numpy HWC BGR uint8, no static crop)._trt_reuse_as_input_buffermarker — graph capture bakes this tensor's address as the engine's input buffer, eliminating the per-frame 1.1 MB DtoD copy._fast_src_host_pinned) socopy_(non_blocking=True)actually runs async._fast_preproc_event) instead ofpre_process_stream.synchronize().inference_models/models/rfdetr/common.py— dispatchesRFDETR_TRITON_FULLPOSTPROCfirst (batch=1, no static crop, no nonsquare-intermediate resize, class remapping active), falls back to Triton conf-filter, then to torch. ReadsRFDETR_PIPELINE_DEPTHand passesnum_slots=RFDETR_PIPELINE_DEPTHtotriton_rfdetr_fullpost.inference_models/models/common/trt.py:TRTCudaGraphState.consumer_done_eventlets the next graph replay wait on the consumer's last use of the output buffers (avoids DtoD clones of output buffers)._capture_cuda_graph(use_pre_processed_images_as_input_buffer=...)supports capturing with an externally-owned input tensor.stream.synchronize()ininfer_from_trt_engine's graph-replay branch; replaced with aproduce_eventrecorded on the graph's own stream.inference/core/models/inference_models_adapters.py_combined_gpu/_counter_gpu/_postproc_done_eventside channels: waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to readn_survivors, then slicescombined[:n]+mask[:n]and pinned-DtoH's both async with a second sync. Bitcasts column 4 of the combined int32 buffer to fp32 vianumpy.view(np.float32)for confidence._build_response_from_det(used by both depth=1 and depth=2 code paths for bit-exact responses).(det, H, W, class_filter)as_pipeline_pendingso the scratch slot stays alive; drains previous pending (waits itsdone_event, pinned-DtoH counter+combined+mask, builds response, returns frame N-1's response). First call returns an empty response.Scratch caching (W7)
_SCRATCH_CACHE— combined/survivor_idx/mask_any/counter tensors reused across frames, keyed by(num_queries, device)(+slot_idxfor depth=2)._CLASS_MAPPING_INT32_CACHE— int32 view of the class_mapping tensor cached byid(source_tensor); previously re-converted every frame because upstream stores it as int64.Benchmark harness
development/stream_interface/rfdetr_nano_seg_trt_workflow.py— replaced with a minimal 116-line InferencePipeline-based driver. Uses theroboflow_core/roboflow_instance_segmentation_model@v3workflow block; single video source, single sink, no annotators/buffering/rate-limiting.--backendflag pre-import setsDISABLED_INFERENCE_MODELS_BACKENDSfor unambiguous backend attribution.Env vars (all opt-in)
RFDETR_USE_TRITON_PREPROCRFDETR_TRITON_POSTPROCRFDETR_TRITON_FULLPOSTPROCRFDETR_GPU_POSTPROCESSRFDETR_DISABLE_GPU_PREPROCENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKENDTRTCudaGraphCacheRFDETR_PIPELINE_DEPTH12to enable depth-2 frame pipeline (see caveats below)How depth-2 works
Before depth-2, the inference thread was serial per frame:
cudaStreamSynchronizewas 80.9% of CUDA API time per nsys — the CPU sat idle waiting for GPU work to complete, then blocked on DtoH.Depth-2 ping-pongs the Triton postproc scratch buffers between two slots so the adapter can drain frame N-1's results while frame N's kernels are launching into the other slot. The
cudaStreamSynchronizecost is now overlapped with the next frame's GPU work instead of blocking the workflow thread.Per-frame flow (depth=2):
N % 2.postprocess()receives frame N's detections.(det, H, W, class_filter)as the new_pipeline_pending— the refs keep slotN % 2alive.(N-1) % 2): waits itsdone_event, pinned-DtoH counter+combined+mask, builds response.The scratch slot that frame N writes is different from the slot frame N-1 is reading, so the two frames' work genuinely runs in parallel on the GPU.
Caveats (depth=2 only, clearly gated)
on_prediction)At 241 FPS the 1-frame skew is ~4 ms between a detection's underlying pixels and the
video_frameit's paired with in the sink callback. Imperceptible for annotation rendering, dashboards, downstream analytics. For strict frame-detection pairing (e.g. training label generation), keep depth=1.The final-frame drop is a real loss (one
Predictionper stream). If that matters, an explicitflush()method could be added on adapter teardown — not plumbed in this PR becauseInferencePipelinedoesn't currently call into the model on shutdown.Parity verification
1 detection, c2:0.9024:68,153,105,176): dropped in depth=2 (stream ended before next call could drain its pending slot).Test plan
pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q— 11/11 pass_trt_reuse_as_input_buffer.class_mapping.to()/direct_copy_kernel_cudalaunches per frame after W7.len(detections_list) == 1— larger batches fall through to depth=1).Relationship to #22
This PR now subsumes #22 — it contains all the commits from #22 plus the depth-2 pipeline commit on top. Merging this PR makes #22 redundant; #22 can be closed once this lands. If you prefer to land the 151-FPS work first without depth-2, #22 is still the smaller PR and this one can be rebased on main afterward.