perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23
Closed
aseembits93 wants to merge 12 commits into
Closed
perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23aseembits93 wants to merge 12 commits into
aseembits93 wants to merge 12 commits into
Conversation
… stream sync reduction
Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:
Baseline (no changes) 93.07 FPS
+ Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6)
+ U7 mask-decode skip for empty masks ~94 FPS (flag-gated)
+ Triton postprocess conf-filter 98.6 FPS (+5.9%)
+ ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%)
+ Drop pre/post stream syncs 102.2 FPS (+9.8%)
Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).
Changes (all flag-gated, opt-in):
inference_models/models/rfdetr/triton_preprocess.py (new)
One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.
inference_models/models/rfdetr/triton_postprocess.py (new)
One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
confidence-threshold filter. Replaces ~14k small cub/torch kernels with
431 (1 per frame). Supports both per-class threshold vector and scalar,
with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Wire the Triton preprocess fast-path in pre_process() with a guarded
dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
- Cache pre-allocated input buffer and normalization constants on model
instance on first call.
- Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
on the inference stream so the CPU doesn't stall waiting for the
preprocessing Triton kernel to finish.
- Drop the post_process_stream.synchronize() (the adapter's subsequent
.cpu() calls provide the implicit sync).
inference_models/models/rfdetr/common.py
Wire the Triton postprocess conf-filter into
post_process_instance_segmentation_results. Falls back to torch path
when the model has no remapping table, is CPU-bound, or Triton is
unavailable.
inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
Same Triton preprocess kernel + dispatch for the legacy inference
package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
default routes to inference_models adapters) but kept for parity so the
legacy path benefits if exercised.
inference/core/models/inference_models_adapters.py
GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
the reference path.
Env vars introduced:
RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel
RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter
RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip
RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (+3.3% FPS vs partial)
Experimental follow-up to the shipped partial-fusion PR. Replaces the whole
post-TRT chain (sigmoid, argmax over classes, class remap, conf threshold,
xywh->xyxy, denorm to orig image, padding subtract, scale divide, clip,
mask bilinear upsample, threshold>0) with two Triton kernel launches.
Results (5-run avg on Tesla T4, example_video, 431 frames):
Baseline 93.07 FPS
Partial fusion (RFDETR_TRITON_POSTPROC) 102.74 FPS (+10.4%)
FULL fusion v1 (RFDETR_TRITON_FULLPOSTPROC) 101.07 FPS (slower than partial!)
FULL fusion v2 (compact mask kernel) 106.05 FPS (+13.9% vs baseline, +3.3% vs partial)
Why v1 was slower:
The fixed-grid "run mask kernel on all 100 queries, zero out filtered rows"
approach did 30x more GPU work than torch because only 3-7 masks typically
survive filtering out of 100, but the kernel processed all of them (10.2M
pixels/frame vs torch's ~300k).
v2 fix:
Added `_rfdetr_fullpost_mask_kernel_compact` which takes an explicit
(n_survivors,) survivor_idx tensor. Host uses keep.nonzero() + a small
DtoH sync to size the kernel launch. Trades one tiny sync (<5 us) for
30x less compute in the mask upsample.
nsys kernel counts (per 431-frame run):
total_k gpu_ms mask_kernel_ms
Partial 117,973 2011 10.9
Full v1 (bad) 10,394 393 342.0
Full v2 (good) 12,315 82 3.5
Parity: 4/431 frames differ by +/-1 detection vs baseline, same 4 frames as
the shipped partial fusion (Triton bilinear vs torchvision antialias at mask
edges). Unit tests pass (11/11).
Files:
inference_models/models/rfdetr/triton_fullpostproc.py (new)
_rfdetr_fullpost_filter_kernel: num_queries programs; per-query fuses
sigmoid(max(logits)) + argmax + class remap + conf threshold + box
denorm + xywh->xyxy + pad subtract + scale divide + clip + store
padded outputs (keep flag for filtered rows).
_rfdetr_fullpost_mask_kernel_compact: n_survivors * tile_y * tile_x
programs; per-tile bilinear upsample 78x78 -> orig_h x orig_w with
survivor_idx lookup, threshold > 0, store as uint8.
inference_models/models/rfdetr/common.py (modified)
New fast path at top of post_process_instance_segmentation_results
gated by RFDETR_TRITON_FULLPOSTPROC=true. Guards: batch=1,
nonsquare_intermediate_size is None, static_crop_offset is (0,0),
classes_re_mapping is not None (rfdetr-seg-nano default).
Env var: RFDETR_TRITON_FULLPOSTPROC=true
Notes / future work:
- The shipped `RFDETR_TRITON_POSTPROC` partial-fusion flag is still useful
as a simpler, more-portable fallback. This full-fusion path covers more
ground but has tighter config guards.
- The key lesson: "fuse into fixed-shape padded output" is only a win when
the kernel's cost scales with inputs (the filter step), NOT outputs (the
mask resize). Per-output-pixel kernels need compaction.
…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).
aseembits93
pushed a commit
that referenced
this pull request
Apr 29, 2026
…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…op of v3) Results (5-run avg, Tesla T4): Baseline 93.07 FPS PR #23 v3 107.94 FPS (+15.9% vs baseline) + this commit (v8) 119.0 FPS (+27.9% vs baseline, +10.3% vs v3) Changes: 1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel): - Added `counter` param (int32 atomic). Each surviving query does `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to compact[slot], skipping filtered queries entirely. - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping so the mask kernel can gather the right input row without a separate nonzero() pass on the host. - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write time; mask kernel then atomic_max's up to 1 only if any pixel survives the threshold. Removes a torch.zeros launch per frame. - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.) into one kernel. - Output ordering is non-deterministic across survivors (atomic-add ordering), but downstream doesn't require sorted order — we already skipped torch.sort in earlier work. 2. Removed TRT output-buffer clone (trt.py:711): - Graph replay writes to fixed output buffers every frame. We used to .clone() 3 times per frame (3 DtoD memcpys) to protect postproc readers from the next frame's replay overwriting data. - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState: postproc records when it's done reading, next TRT replay .wait()s on the event before submitting. Zero DtoD clones, proper cross-stream ordering. 3. Removed the outer stream.synchronize() in infer_from_trt_engine: - Was blocking CPU after every TRT forward. Replaced with a cuda.Event recorded on the TRT graph stream; postproc consumer .wait()s on that event from the post_process_stream. Forward returns to CPU immediately. 4. Cached threshold tensor (triton_fullpostproc._prepare_threshold): - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD). - Now cached per (value, device) — single HtoD across the run. 5. Adapter simplification (inference_models_adapters.py): - Removed the GPU-side mask_any filter + masked_select + nonempty_idx DtoH/HtoD roundtrip. det.mask from the Triton path is already compact (n_survivors rows); masks2poly fast-skips empty masks internally via np.any, so CPU-side filtering is adequate. - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather kernel + 1 torch.as_tensor HtoD per frame. Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11). nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8).
…Triton kernels The filter->mask bubble in nsys was 146us per frame, caused by `int(counter.item())` blocking the host between the two Triton launches. Fix: launch the mask kernel with `grid=(num_queries, tile_y, tile_x)` and have each program read `counter[0]` on the GPU and early-exit if its `s` index exceeds `n_survivors`. Host reads counter only AFTER the mask kernel is queued, letting the two kernels pipeline on the stream. Tradeoffs: - Mask kernel runtime grew 27us -> 112us (scheduler iterates ~100 programs, most early-exit). Net gain: ~61us/frame. - Still need one DtoH (4 bytes) at the very end to get the final n_survivors for tensor views. That's the single remaining mandatory sync. Also cached the mask_bin tensor at (num_queries, H, W) on module scope to skip the per-frame torch.empty() launch (~20us saved). nsys per-frame postproc bubble ratio: 35% (v8) -> 31% (v10). Results: v8 avg FPS: 119.0 v10 avg FPS: 119.75 (+0.6% over v8, +28.7% over baseline) Parity: 4/431 frames differ by +/-1 detection vs baseline (same 4 as before). Unit tests pass (11/11).
…turn path Two torch .to(torch.bool) kernels per frame (431 + 628 launches) from: 1. common.py:125 `mask=mask_bin.to(torch.bool)` — unused downstream 2. triton_fullpostproc.py return `mask_any_view.bool()` — also unused Both are trivial but remove ~1000 kernel launches per run. Small FPS gain (119.75 -> 120.37, +0.5%) and cleaner profile. nsys per-frame postproc bubble ratio: 31% (v10) -> 28% (v11). Remaining bubbles are dominated by: - 87us initial wait for TRT engine (unavoidable without pipelining frames) - 173us CPU-Python overhead between counter.item() sync and first .cpu() - 3x 25-40us tiny DtoH gaps (launch overhead, not transfer time) These all stem from Python interpreter cost around `.cpu()` calls. Further reduction needs structural changes (frame pipelining) or C-extension work. Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
aseembits93
pushed a commit
that referenced
this pull request
Apr 29, 2026
…op of v3) Results (5-run avg, Tesla T4): Baseline 93.07 FPS PR #23 v3 107.94 FPS (+15.9% vs baseline) + this commit (v8) 119.0 FPS (+27.9% vs baseline, +10.3% vs v3) Changes: 1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel): - Added `counter` param (int32 atomic). Each surviving query does `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to compact[slot], skipping filtered queries entirely. - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping so the mask kernel can gather the right input row without a separate nonzero() pass on the host. - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write time; mask kernel then atomic_max's up to 1 only if any pixel survives the threshold. Removes a torch.zeros launch per frame. - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.) into one kernel. - Output ordering is non-deterministic across survivors (atomic-add ordering), but downstream doesn't require sorted order — we already skipped torch.sort in earlier work. 2. Removed TRT output-buffer clone (trt.py:711): - Graph replay writes to fixed output buffers every frame. We used to .clone() 3 times per frame (3 DtoD memcpys) to protect postproc readers from the next frame's replay overwriting data. - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState: postproc records when it's done reading, next TRT replay .wait()s on the event before submitting. Zero DtoD clones, proper cross-stream ordering. 3. Removed the outer stream.synchronize() in infer_from_trt_engine: - Was blocking CPU after every TRT forward. Replaced with a cuda.Event recorded on the TRT graph stream; postproc consumer .wait()s on that event from the post_process_stream. Forward returns to CPU immediately. 4. Cached threshold tensor (triton_fullpostproc._prepare_threshold): - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD). - Now cached per (value, device) — single HtoD across the run. 5. Adapter simplification (inference_models_adapters.py): - Removed the GPU-side mask_any filter + masked_select + nonempty_idx DtoH/HtoD roundtrip. det.mask from the Triton path is already compact (n_survivors rows); masks2poly fast-skips empty masks internally via np.any, so CPU-side filtering is adequate. - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather kernel + 1 torch.as_tensor HtoD per frame. Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11). nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ansfer Prior layout had 4 separate small .cpu() calls in the adapter: 4B counter.item() 48B xyxy.cpu() 12B conf.cpu() 12B class_id.cpu() Each has ~25-40us of Python+sync overhead. Collapsing them to 1 transfer cuts 4 events with 3 bubbles down to 2 events with 1 bubble. Change: filter kernel now writes a single combined (num_queries, 6) int32 buffer packing [x1, y1, x2, y2, conf_as_int32_bits, class_id] per slot. Adapter does one .cpu() of combined and bitcasts column 4 to fp32 on the host side via numpy.view(np.float32). counter.item() stays inside the Triton wrapper (must be inside the postproc stream context). nsys per-frame postproc window (after mask kernel -> final DtoH): v11: 4 small DtoH + 3 bubbles = ~130us of plumbing v13: 1 small DtoH + 1 bubble = ~75us of plumbing Bubbles: 28% (v11) -> 23% (v13). FPS: 120.37 (v11) -> 120.49 (v13). Small absolute FPS change because the saved plumbing time is already tiny in absolute terms — the wins on this scale are about reducing bubble density, which matters more on slower devices (Jetson). Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
…ansfers Replace the adapter's two synchronous .cpu() calls with pinned-memory .copy_(non_blocking=True) + a single stream.synchronize(). Both the 72B combined buffer and the 307KB mask buffer now pipeline on the copy engine in parallel rather than chaining. Pinned buffers cached per (name, dtype) and reused across frames, growing if a larger shape is ever requested (unlikely in practice — num_queries is 100 and orig H/W fixed per video). nsys per-frame postproc window: v13: 2 DtoH calls with 49us gap between = 95us for plumbing v14: 2 DtoH calls with 15us gap between = 68us for plumbing Bubbles: 23% (v13) -> 24% (v14 — within noise). FPS: 120.49 (v13) -> 121.21 (v14). Small since the transfer time was already dwarfed by Python interpreter overhead (~200us between the mask kernel ending and the adapter's first DtoH). Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
Two preprocess-side bubble fixes:
1. Eliminate the preproc_buffer -> graph_input_buffer DtoD memcpy.
Previously every frame did:
- Triton preproc wrote to self._fast_input_buffer (1x3x312x312 fp32)
- TRT graph replay copied self._fast_input_buffer into its own
input_buffer via trt_cuda_graph_state.input_buffer.copy_(...)
- TRT replay read from its graph-owned input_buffer
The DtoD was 1168128 B per frame = 508 MB over a 431-frame run.
Fix: thread a "use external buffer" hint via a tensor attribute
(_trt_reuse_as_input_buffer). When set, _capture_cuda_graph bakes the
external tensor's address into the graph instead of allocating its own.
On replay, we skip the DtoD entirely by checking data_ptr equality.
Result: DtoD per run 439 -> 8 (98% reduction).
2. Pinned-memory HtoD for the raw BGR frame.
Previously: torch.from_numpy(np.ascontiguousarray(images)).to(device,
non_blocking=True) silently falls back to sync HtoD because the numpy
source isn't in pinned memory.
Fix: cache a pinned host buffer + GPU buffer on first call, copy the
numpy frame into pinned with np.copyto (fast CPU memcpy), then
tensor.copy_(pinned, non_blocking=True) for a genuine async HtoD.
5-run avg FPS: 121.21 (v14) -> 122.40 (v15). Parity: 4/431 frames differ
(same 4). Unit tests 11/11 pass.
11b38e7 to
df0d306
Compare
…l order
Two Triton postproc tweaks to shrink numerical drift vs the non-Triton path:
1. Banker's rounding (half-to-even) for xyxy integer casts, matching
torch.round().int() exactly. Catches the rare case where a bbox ends
up at an exact x.5 boundary.
2. Reorder the bbox math so FP32 evaluation order matches the baseline:
x_min_pct = cx_pct - 0.5 * w_pct
x_min_px = x_min_pct * inference_w
x_min = (x_min_px - pad_left) * inv_scale_w
Previously we scaled early (cx * W, then sub w*W*0.5) which gave
different FP32 rounding in the subtract-scaled-values step.
Verified via per-frame parity capture on vehicles_312px.mp4 (538 frames):
Config Exact match max |Δconf| max |Δxyxy|
Baseline 100.0% 0 0
+ CUDA graphs only 100.0% 0 0
+ Triton postproc (no preproc) 99.8% 0 1 px
+ Triton postproc + Triton preproc 0.2% 0.44 119 px
The Triton post-process path is bit-parity clean with the baseline
(99.8% of frames exact match, zero confidence deltas, max 1 pixel
bbox shift on one frame).
All remaining drift comes from the Triton PREPROCESS path: our Triton
bilinear resize matches torch.nn.functional.interpolate(bilinear) to
1e-6, but the baseline uses cv2.resize which has integer-arithmetic
bilinear semantics. The ~0.01 preproc pixel drift cascades through the
TRT engine to give ~1% conf drift + 9% of bboxes shifted by 1 pixel.
No semantic regression: detection totals match exactly (1899 = 1899),
detection counts off by 1 on 12/538 frames (2.2%, all at
marginal-confidence thresholds), no det with conf > 0.5 ever disappears.
NMS tie-break flips produce the few outlier deltas.
7 tasks
Owner
Author
|
Closing in favor of #22. Both PRs cover the same optimization arc on the same files; #22's branch ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flags to enable (all default off):
RFDETR_USE_TRITON_PREPROC=true— fused Triton preprocessing kernelRFDETR_TRITON_FULLPOSTPROC=true— fused Triton post-processing kernelsENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true— enable CUDA-graph replay for TRT forwardCommit-by-commit changes
Each row shows the 5-run avg FPS after the commit lands, and the delta over the prior commit.
fuse ENTIRE post-process into 2 Triton kernelsadapter-level refinements on top of full-fusionkill bubbles in post-process (+10.3% FPS on top of v3)remove CPU-side sync between filter and mask Triton kernelsdrop unnecessary .bool() casts in postproc return pathbatch scalar DtoH into single 72B combined transferpinned host buffers + async DtoH for final transferseliminate preproc->TRT DtoD copy + pinned HtoD1. Fuse ENTIRE post-process into 2 Triton kernels (+13.9%)
Replaces the whole post-TRT chain (sigmoid + argmax + class-remap + conf threshold + xywh→xyxy + denorm + pad subtract + scale divide + clip + mask bilinear upsample + threshold > 0) with two Triton kernel launches.
_rfdetr_fullpost_filter_kernel— grid =num_queries. Per query: fused filter + box denorm + clip. Writes padded fixed-shape outputs (keepmarks filtered rows)._rfdetr_fullpost_mask_kernel_compact— grid =n_survivors × tile_y × tile_x. Per tile: inverse-map orig→mask coords, bilinear upsample, threshold > 0, store uint8. Usessurvivor_idxso only filtered masks get processed (30× less compute than the fixed-padded variant).Key insight: "fuse into fixed-shape padded output" only wins when the kernel cost scales with inputs (filter step = 100 queries × 91 classes, constant). Per-output-pixel kernels need explicit compaction ("zero out filtered rows" is a trap when 95% of output pixels are wasted work).
2. Adapter-level refinements on top of full-fusion (+1.8%)
Three eliminations of downstream kernels via in-kernel work:
round_kernel_cudaper runmask_anyviatl.atomic_max→ removes 314or_kernel_cudareduces per runtorch.zerosfill (20 ms) + 4[keep]indexing kernels per frameKernels dropped: 10,900 → 4,669 (-57%). GPU postproc time: 60 ms → 17 ms (-72%).
3. Kill bubbles in post-process (+10.2%)
Biggest single jump. Five eliminations combined:
slot = tl.atomic_add(counter_ptr, 1)and writes outputs tocompact[slot], skipping filtered queries entirely. Absorbs 13 downstream kernels (keep.bool, the 3-cubnonzero()pipeline, 3index_selectgathers,torch.zeros,.bool()cast).trt.py:711. Graph replay writes to fixed output buffers; instead of 3× DtoD memcpy per frame to protect readers, replaced with aconsumer_done_eventonTRTCudaGraphStatethat the next replay waits on.stream.synchronize()ininfer_from_trt_engine. Was blocking CPU after every forward. Replaced with a cross-streamcuda.Eventthat postproc waits on from its own stream.torch.tensor([thr])(4 B HtoD) per frame. Now cached per (value, device)..any()+masked_select+nonempty_idxDtoH/HtoD roundtrip. Sincedet.maskfrom the Triton path is already compact andmasks2polyfast-skips empty masks, CPU-side filter is adequate.Postproc bubble ratio: 83% → 35%.
4. Remove CPU-side sync between filter and mask Triton kernels (+0.6%)
The filter→mask bubble was 146 µs per frame caused by
int(counter.item())blocking the host between the two Triton launches. Fix: launch the mask kernel withgrid=(num_queries, tile_y, tile_x)and have each program readcounter[0]on the GPU and early-exit if itssindex exceedsn_survivors. Host reads counter only after the mask kernel is queued, so the two kernels pipeline on the stream. Tradeoff: mask kernel runtime grew 27 → 112 µs (scheduler iterates ~100 programs, most early-exit), net gain ~61 µs/frame.Also cached
mask_binat max(num_queries, H, W)on module scope to skip per-frametorch.empty()launch.5. Drop unnecessary
.bool()casts in postproc return path (+0.5%)Two
torch.to(torch.bool)kernels per frame from dead conversions (common.py:125and the wrapper return). Removed, no functional impact. ~1000 kernel launches deleted per run.6. Batch scalar DtoH into single 72B combined transfer (+0.1%)
Four separate small
.cpu()calls in the adapter (4B counter + 48B xyxy + 12B conf + 12B class_id), each with ~25-40 µs Python+sync overhead. Collapsed to one transfer: filter kernel now writes a single combined(num_queries, 6)int32 buffer packing[x1, y1, x2, y2, conf_as_i32_bits, class_id]per slot. Adapter does one.cpu()and bitcasts column 4 to fp32 vianumpy.view(np.float32).7. Pinned host buffers + async DtoH for final transfers (+0.6%)
Replace the adapter's two synchronous
.cpu()calls with pinned-memory.copy_(non_blocking=True)+ a singlestream.synchronize(). Both the 72B combined buffer and the 307KB mask buffer now pipeline on the copy engine in parallel rather than chaining. Pinned buffers cached per(name, dtype)and reused across frames.DtoH timing: 49 µs gap between scalar + mask transfers → 15 µs.
8. Eliminate preproc → TRT DtoD copy + pinned HtoD (+1.0%)
Two preprocess-side fixes:
Eliminate the per-frame DtoD from
_fast_input_buffer → graph_input_buffer(1,168,128 B per frame = 508 MB per run). Threaded a_trt_reuse_as_input_buffertensor-attribute hint through_capture_cuda_graph. When set, the graph bakes the external tensor's address as its input buffer. On replay, skip the DtoD whendata_ptrmatches. Result: DtoD/PtoP memcpys per run 439 → 8 (-98%).Pinned-memory HtoD for the raw BGR frame. Was using
torch.from_numpy(ascontiguousarray(img)).to(device, non_blocking=True)— but without pinned source,non_blocking=Truesilently falls back to sync. Now uses a cached pinned host buffer + GPU buffer, copies the numpy frame into pinned withnp.copyto, thentensor.copy_(pinned, non_blocking=True)for a genuine async HtoD.Remaining bubbles (future work)
From the final v15 profile:
_rfdetr_stretch_preprocess_kernelcompleting andcudaGraphLaunch— pure call-chain cost acrosspre_process→infer→forward→infer_from_trt_engine→ 4 more wrapper layers →cuda_graph.replay. Eliminable via call-chain flattening..cpu()call. Same category — CPython cost, not GPU.Files changed
inference_models/models/rfdetr/triton_fullpostproc.pyinference_models/models/rfdetr/triton_preprocess.pyinference_models/models/rfdetr/common.pyRFDETR_TRITON_FULLPOSTPROCintopost_process_instance_segmentation_results.inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py_trt_reuse_as_input_buffermarker, event-based cross-stream ordering.inference_models/models/common/trt.py_trt_reuse_as_input_buffersupport, removed output-buffer clone, cross-stream event plumbing.inference/core/models/inference_models_adapters.pyTest plan
pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q— 11/11 pass/tmp/nsys_profiles/rfdetr_trt_fullpost_v{3..15}.nsys-rep