perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%) by aseembits93 · Pull Request #23 · aseembits93/inference

aseembits93 · 2026-04-29T02:06:19Z

Summary

Metric	Baseline	Final	Δ
Avg FPS (Tesla T4, example_video.mp4, 431 frames, 5-run avg)	93.07	122.40	+31.5%
Post-process bubble ratio (nsys)	83%	24%	-59 pts
GPU kernels per run	133,821	~7,000	-94%
DtoD/PtoP memcpys per run	439	8	-98%
Postproc kernel cub/torch ops per frame	~45	~3	-93%
Parity vs baseline	—	4/431 frames ±1 detection	within 5% tolerance
Unit tests	—	11/11 pass	✓

Flags to enable (all default off):

RFDETR_USE_TRITON_PREPROC=true — fused Triton preprocessing kernel
RFDETR_TRITON_FULLPOSTPROC=true — fused Triton post-processing kernels
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true — enable CUDA-graph replay for TRT forward

Commit-by-commit changes

Each row shows the 5-run avg FPS after the commit lands, and the delta over the prior commit.

#	Commit	Avg FPS	Δ over prev	Δ vs baseline
0	baseline (before this PR)	93.07	—	—
1	`fuse ENTIRE post-process into 2 Triton kernels`	106.05	+13.9%	+13.9%
2	`adapter-level refinements on top of full-fusion`	107.94	+1.8%	+15.9%
3	`kill bubbles in post-process (+10.3% FPS on top of v3)`	119.00	+10.2%	+27.9%
4	`remove CPU-side sync between filter and mask Triton kernels`	119.75	+0.6%	+28.7%
5	`drop unnecessary .bool() casts in postproc return path`	120.37	+0.5%	+29.3%
6	`batch scalar DtoH into single 72B combined transfer`	120.49	+0.1%	+29.5%
7	`pinned host buffers + async DtoH for final transfers`	121.21	+0.6%	+30.2%
8	`eliminate preproc->TRT DtoD copy + pinned HtoD`	122.40	+1.0%	+31.5%

1. Fuse ENTIRE post-process into 2 Triton kernels (+13.9%)

Replaces the whole post-TRT chain (sigmoid + argmax + class-remap + conf threshold + xywh→xyxy + denorm + pad subtract + scale divide + clip + mask bilinear upsample + threshold > 0) with two Triton kernel launches.

_rfdetr_fullpost_filter_kernel — grid = num_queries. Per query: fused filter + box denorm + clip. Writes padded fixed-shape outputs (keep marks filtered rows).
_rfdetr_fullpost_mask_kernel_compact — grid = n_survivors × tile_y × tile_x. Per tile: inverse-map orig→mask coords, bilinear upsample, threshold > 0, store uint8. Uses survivor_idx so only filtered masks get processed (30× less compute than the fixed-padded variant).

Key insight: "fuse into fixed-shape padded output" only wins when the kernel cost scales with inputs (filter step = 100 queries × 91 classes, constant). Per-output-pixel kernels need explicit compaction ("zero out filtered rows" is a trap when 95% of output pixels are wasted work).

2. Adapter-level refinements on top of full-fusion (+1.8%)

Three eliminations of downstream kernels via in-kernel work:

Filter kernel rounds + casts xyxy to int32 internally → removes 431 round_kernel_cuda per run
Mask kernel accumulates per-survivor mask_any via tl.atomic_max → removes 314 or_kernel_cuda reduces per run
Compact (n_survivors)-shape outputs → removes per-frame torch.zeros fill (20 ms) + 4 [keep] indexing kernels per frame

Kernels dropped: 10,900 → 4,669 (-57%). GPU postproc time: 60 ms → 17 ms (-72%).

3. Kill bubbles in post-process (+10.2%)

Biggest single jump. Five eliminations combined:

Atomic-counter compaction in filter kernel — each surviving query does slot = tl.atomic_add(counter_ptr, 1) and writes outputs to compact[slot], skipping filtered queries entirely. Absorbs 13 downstream kernels (keep.bool, the 3-cub nonzero() pipeline, 3 index_select gathers, torch.zeros, .bool() cast).
Removed TRT output-buffer clone in trt.py:711. Graph replay writes to fixed output buffers; instead of 3× DtoD memcpy per frame to protect readers, replaced with a consumer_done_event on TRTCudaGraphState that the next replay waits on.
Removed outer stream.synchronize() in infer_from_trt_engine. Was blocking CPU after every forward. Replaced with a cross-stream cuda.Event that postproc waits on from its own stream.
Cached threshold tensor — was allocating torch.tensor([thr]) (4 B HtoD) per frame. Now cached per (value, device).
Simplified adapter — removed the GPU-side .any() + masked_select + nonempty_idx DtoH/HtoD roundtrip. Since det.mask from the Triton path is already compact and masks2poly fast-skips empty masks, CPU-side filter is adequate.

Postproc bubble ratio: 83% → 35%.

4. Remove CPU-side sync between filter and mask Triton kernels (+0.6%)

The filter→mask bubble was 146 µs per frame caused by int(counter.item()) blocking the host between the two Triton launches. Fix: launch the mask kernel with grid=(num_queries, tile_y, tile_x) and have each program read counter[0] on the GPU and early-exit if its s index exceeds n_survivors. Host reads counter only after the mask kernel is queued, so the two kernels pipeline on the stream. Tradeoff: mask kernel runtime grew 27 → 112 µs (scheduler iterates ~100 programs, most early-exit), net gain ~61 µs/frame.

Also cached mask_bin at max (num_queries, H, W) on module scope to skip per-frame torch.empty() launch.

5. Drop unnecessary `.bool()` casts in postproc return path (+0.5%)

Two torch.to(torch.bool) kernels per frame from dead conversions (common.py:125 and the wrapper return). Removed, no functional impact. ~1000 kernel launches deleted per run.

6. Batch scalar DtoH into single 72B combined transfer (+0.1%)

Four separate small .cpu() calls in the adapter (4B counter + 48B xyxy + 12B conf + 12B class_id), each with ~25-40 µs Python+sync overhead. Collapsed to one transfer: filter kernel now writes a single combined (num_queries, 6) int32 buffer packing [x1, y1, x2, y2, conf_as_i32_bits, class_id] per slot. Adapter does one .cpu() and bitcasts column 4 to fp32 via numpy.view(np.float32).

7. Pinned host buffers + async DtoH for final transfers (+0.6%)

Replace the adapter's two synchronous .cpu() calls with pinned-memory .copy_(non_blocking=True) + a single stream.synchronize(). Both the 72B combined buffer and the 307KB mask buffer now pipeline on the copy engine in parallel rather than chaining. Pinned buffers cached per (name, dtype) and reused across frames.

DtoH timing: 49 µs gap between scalar + mask transfers → 15 µs.

8. Eliminate preproc → TRT DtoD copy + pinned HtoD (+1.0%)

Two preprocess-side fixes:

Eliminate the per-frame DtoD from _fast_input_buffer → graph_input_buffer (1,168,128 B per frame = 508 MB per run). Threaded a _trt_reuse_as_input_buffer tensor-attribute hint through _capture_cuda_graph. When set, the graph bakes the external tensor's address as its input buffer. On replay, skip the DtoD when data_ptr matches. Result: DtoD/PtoP memcpys per run 439 → 8 (-98%).

Pinned-memory HtoD for the raw BGR frame. Was using torch.from_numpy(ascontiguousarray(img)).to(device, non_blocking=True) — but without pinned source, non_blocking=True silently falls back to sync. Now uses a cached pinned host buffer + GPU buffer, copies the numpy frame into pinned with np.copyto, then tensor.copy_(pinned, non_blocking=True) for a genuine async HtoD.

Remaining bubbles (future work)

From the final v15 profile:

~175 µs Python interpreter overhead between _rfdetr_stretch_preprocess_kernel completing and cudaGraphLaunch — pure call-chain cost across pre_process → infer → forward → infer_from_trt_engine → 4 more wrapper layers → cuda_graph.replay. Eliminable via call-chain flattening.
~200 µs Python overhead in the adapter between the mask kernel ending and the first .cpu() call. Same category — CPython cost, not GPU.
CPU-GPU pipelining: currently serial per frame. GPU is idle ~50% of wall time waiting for CPU plumbing. A ping-pong double-buffer scheme could reclaim another +40-55% FPS (estimated ceiling ~170-190 FPS).

Files changed

File	Purpose
`inference_models/models/rfdetr/triton_fullpostproc.py`	New. Two Triton kernels + host wrapper for the fused post-process.
`inference_models/models/rfdetr/triton_preprocess.py`	Existing (PR #22). Used by fast-path preproc.
`inference_models/models/rfdetr/common.py`	Wires `RFDETR_TRITON_FULLPOSTPROC` into `post_process_instance_segmentation_results`.
`inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py`	Pinned HtoD, buffer reuse, `_trt_reuse_as_input_buffer` marker, event-based cross-stream ordering.
`inference_models/models/common/trt.py`	`_trt_reuse_as_input_buffer` support, removed output-buffer clone, cross-stream event plumbing.
`inference/core/models/inference_models_adapters.py`	Batched combined-buffer DtoH, pinned async DtoH, simplified mask filter.

Test plan

pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q — 11/11 pass
Benchmark: 5-run avg 122.40 ± 0.9 FPS
Parity: 4/431 frames differ by ±1 detection vs baseline (same 4 frames as PR perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) #22; attributable to Triton bilinear vs torchvision antialias at mask edges)
nsys profiles saved at /tmp/nsys_profiles/rfdetr_trt_fullpost_v{3..15}.nsys-rep
Test on higher-res video (mask kernel compute scales with orig_h × orig_w)
Test on Jetson Orin (pinned-memory + async-transfer wins should scale better there)

… stream sync reduction Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine, example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes below + enabling the existing CUDA-graph cache: Baseline (no changes) 93.07 FPS + Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6) + U7 mask-decode skip for empty masks ~94 FPS (flag-gated) + Triton postprocess conf-filter 98.6 FPS (+5.9%) + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%) + Drop pre/post stream syncs 102.2 FPS (+9.8%) Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs cv2.resize rounding at mask boundaries). Unit tests pass (11/11). Changes (all flag-gated, opt-in): inference_models/models/rfdetr/triton_preprocess.py (new) One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true. inference_models/models/rfdetr/triton_postprocess.py (new) One Triton kernel fusing sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Replaces ~14k small cub/torch kernels with 431 (1 per frame). Supports both per-class threshold vector and scalar, with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true. inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py - Wire the Triton preprocess fast-path in pre_process() with a guarded dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop). - Cache pre-allocated input buffer and normalization constants on model instance on first call. - Replace pre_process_stream.synchronize() with a CUDA event ev.wait() on the inference stream so the CPU doesn't stall waiting for the preprocessing Triton kernel to finish. - Drop the post_process_stream.synchronize() (the adapter's subsequent .cpu() calls provide the implicit sync). inference_models/models/rfdetr/common.py Wire the Triton postprocess conf-filter into post_process_instance_segmentation_results. Falls back to torch path when the model has no remapping table, is CPU-bound, or Triton is unavailable. inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path) Same Triton preprocess kernel + dispatch for the legacy inference package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS default routes to inference_models adapters) but kept for parity so the legacy path benefits if exercised. inference/core/models/inference_models_adapters.py GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2)) on GPU, only DtoH + cv2.findContours non-empty masks. Gated via RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to the reference path. Env vars introduced: RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s (+3.3% FPS vs partial) Experimental follow-up to the shipped partial-fusion PR. Replaces the whole post-TRT chain (sigmoid, argmax over classes, class remap, conf threshold, xywh->xyxy, denorm to orig image, padding subtract, scale divide, clip, mask bilinear upsample, threshold>0) with two Triton kernel launches. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (RFDETR_TRITON_POSTPROC) 102.74 FPS (+10.4%) FULL fusion v1 (RFDETR_TRITON_FULLPOSTPROC) 101.07 FPS (slower than partial!) FULL fusion v2 (compact mask kernel) 106.05 FPS (+13.9% vs baseline, +3.3% vs partial) Why v1 was slower: The fixed-grid "run mask kernel on all 100 queries, zero out filtered rows" approach did 30x more GPU work than torch because only 3-7 masks typically survive filtering out of 100, but the kernel processed all of them (10.2M pixels/frame vs torch's ~300k). v2 fix: Added `_rfdetr_fullpost_mask_kernel_compact` which takes an explicit (n_survivors,) survivor_idx tensor. Host uses keep.nonzero() + a small DtoH sync to size the kernel launch. Trades one tiny sync (<5 us) for 30x less compute in the mask upsample. nsys kernel counts (per 431-frame run): total_k gpu_ms mask_kernel_ms Partial 117,973 2011 10.9 Full v1 (bad) 10,394 393 342.0 Full v2 (good) 12,315 82 3.5 Parity: 4/431 frames differ by +/-1 detection vs baseline, same 4 frames as the shipped partial fusion (Triton bilinear vs torchvision antialias at mask edges). Unit tests pass (11/11). Files: inference_models/models/rfdetr/triton_fullpostproc.py (new) _rfdetr_fullpost_filter_kernel: num_queries programs; per-query fuses sigmoid(max(logits)) + argmax + class remap + conf threshold + box denorm + xywh->xyxy + pad subtract + scale divide + clip + store padded outputs (keep flag for filtered rows). _rfdetr_fullpost_mask_kernel_compact: n_survivors * tile_y * tile_x programs; per-tile bilinear upsample 78x78 -> orig_h x orig_w with survivor_idx lookup, threshold > 0, store as uint8. inference_models/models/rfdetr/common.py (modified) New fast path at top of post_process_instance_segmentation_results gated by RFDETR_TRITON_FULLPOSTPROC=true. Guards: batch=1, nonsquare_intermediate_size is None, static_crop_offset is (0,0), classes_re_mapping is not None (rfdetr-seg-nano default). Env var: RFDETR_TRITON_FULLPOSTPROC=true Notes / future work: - The shipped `RFDETR_TRITON_POSTPROC` partial-fusion flag is still useful as a simpler, more-portable fallback. This full-fusion path covers more ground but has tighter config guards. - The key lesson: "fuse into fixed-shape padded output" is only a win when the kernel's cost scales with inputs (the filter step), NOT outputs (the mask resize). Per-output-pixel kernels need compaction.

…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).

…on (+1.8% FPS) Follow-up optimizations on the full-postproc fusion to eliminate the downstream plumbing kernels still visible in nsys after PR #23. Results (5-run avg on Tesla T4, example_video, 431 frames): Baseline 93.07 FPS Partial fusion (PR #22) 102.74 FPS (+10.3%) Full fusion v2 (PR #23 initial) 106.05 FPS (+13.9%) Full fusion v3 (this commit) 107.94 FPS (+15.9%) Changes: 1. `triton_fullpostproc.py`: - Filter kernel: rounds + casts xyxy to int32 inside the kernel (replaces a downstream `.round().int()` elementwise kernel). - Mask kernel: accumulates per-survivor `mask_any` via tile-level `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))` reduce. Writes to compact `(n_survivors, H, W)` output instead of padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros` fill kernel. - Host wrapper: returns compact tensors directly (xyxy_int32, conf, cls_id, mask_bin uint8, mask_any bool), all sized to n_survivors. The single `keep.nonzero()` remains as the only DtoH sync in the fused path. 2. `common.py`: - Build `InstanceDetections` with compact tensors directly; skip the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing chain that added 5 kernels per frame. - Attach precomputed `mask_any` onto the detections object via `__dict__` so the adapter can use it without recomputation. 3. `inference_models_adapters.py`: - Use `getattr(det, "mask_any", None)` from the full-fusion path when available; fall back to `.any(dim=(1,2))` otherwise. nsys kernel count shrinkage (per 431-frame run): total_k legacy_pp pp_ms FULL v2 12,315 10,900 60ms FULL v3 7,026 4,669 17ms (-43% total, -57% pp, -72% pp_ms) Specifically eliminated: - DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero() calls per frame collapsed to 1). - FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros). - or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton). - round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel). - direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select on scalar fields, but they're tiny). Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…op of v3) Results (5-run avg, Tesla T4): Baseline 93.07 FPS PR #23 v3 107.94 FPS (+15.9% vs baseline) + this commit (v8) 119.0 FPS (+27.9% vs baseline, +10.3% vs v3) Changes: 1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel): - Added `counter` param (int32 atomic). Each surviving query does `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to compact[slot], skipping filtered queries entirely. - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping so the mask kernel can gather the right input row without a separate nonzero() pass on the host. - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write time; mask kernel then atomic_max's up to 1 only if any pixel survives the threshold. Removes a torch.zeros launch per frame. - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.) into one kernel. - Output ordering is non-deterministic across survivors (atomic-add ordering), but downstream doesn't require sorted order — we already skipped torch.sort in earlier work. 2. Removed TRT output-buffer clone (trt.py:711): - Graph replay writes to fixed output buffers every frame. We used to .clone() 3 times per frame (3 DtoD memcpys) to protect postproc readers from the next frame's replay overwriting data. - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState: postproc records when it's done reading, next TRT replay .wait()s on the event before submitting. Zero DtoD clones, proper cross-stream ordering. 3. Removed the outer stream.synchronize() in infer_from_trt_engine: - Was blocking CPU after every TRT forward. Replaced with a cuda.Event recorded on the TRT graph stream; postproc consumer .wait()s on that event from the post_process_stream. Forward returns to CPU immediately. 4. Cached threshold tensor (triton_fullpostproc._prepare_threshold): - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD). - Now cached per (value, device) — single HtoD across the run. 5. Adapter simplification (inference_models_adapters.py): - Removed the GPU-side mask_any filter + masked_select + nonempty_idx DtoH/HtoD roundtrip. det.mask from the Triton path is already compact (n_survivors rows); masks2poly fast-skips empty masks internally via np.any, so CPU-side filtering is adequate. - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather kernel + 1 torch.as_tensor HtoD per frame. Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11). nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8).

…Triton kernels The filter->mask bubble in nsys was 146us per frame, caused by `int(counter.item())` blocking the host between the two Triton launches. Fix: launch the mask kernel with `grid=(num_queries, tile_y, tile_x)` and have each program read `counter[0]` on the GPU and early-exit if its `s` index exceeds `n_survivors`. Host reads counter only AFTER the mask kernel is queued, letting the two kernels pipeline on the stream. Tradeoffs: - Mask kernel runtime grew 27us -> 112us (scheduler iterates ~100 programs, most early-exit). Net gain: ~61us/frame. - Still need one DtoH (4 bytes) at the very end to get the final n_survivors for tensor views. That's the single remaining mandatory sync. Also cached the mask_bin tensor at (num_queries, H, W) on module scope to skip the per-frame torch.empty() launch (~20us saved). nsys per-frame postproc bubble ratio: 35% (v8) -> 31% (v10). Results: v8 avg FPS: 119.0 v10 avg FPS: 119.75 (+0.6% over v8, +28.7% over baseline) Parity: 4/431 frames differ by +/-1 detection vs baseline (same 4 as before). Unit tests pass (11/11).

…turn path Two torch .to(torch.bool) kernels per frame (431 + 628 launches) from: 1. common.py:125 `mask=mask_bin.to(torch.bool)` — unused downstream 2. triton_fullpostproc.py return `mask_any_view.bool()` — also unused Both are trivial but remove ~1000 kernel launches per run. Small FPS gain (119.75 -> 120.37, +0.5%) and cleaner profile. nsys per-frame postproc bubble ratio: 31% (v10) -> 28% (v11). Remaining bubbles are dominated by: - 87us initial wait for TRT engine (unavoidable without pipelining frames) - 173us CPU-Python overhead between counter.item() sync and first .cpu() - 3x 25-40us tiny DtoH gaps (launch overhead, not transfer time) These all stem from Python interpreter cost around `.cpu()` calls. Further reduction needs structural changes (frame pipelining) or C-extension work. Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.

…op of v3) Results (5-run avg, Tesla T4): Baseline 93.07 FPS PR #23 v3 107.94 FPS (+15.9% vs baseline) + this commit (v8) 119.0 FPS (+27.9% vs baseline, +10.3% vs v3) Changes: 1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel): - Added `counter` param (int32 atomic). Each surviving query does `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to compact[slot], skipping filtered queries entirely. - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping so the mask kernel can gather the right input row without a separate nonzero() pass on the host. - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write time; mask kernel then atomic_max's up to 1 only if any pixel survives the threshold. Removes a torch.zeros launch per frame. - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.) into one kernel. - Output ordering is non-deterministic across survivors (atomic-add ordering), but downstream doesn't require sorted order — we already skipped torch.sort in earlier work. 2. Removed TRT output-buffer clone (trt.py:711): - Graph replay writes to fixed output buffers every frame. We used to .clone() 3 times per frame (3 DtoD memcpys) to protect postproc readers from the next frame's replay overwriting data. - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState: postproc records when it's done reading, next TRT replay .wait()s on the event before submitting. Zero DtoD clones, proper cross-stream ordering. 3. Removed the outer stream.synchronize() in infer_from_trt_engine: - Was blocking CPU after every TRT forward. Replaced with a cuda.Event recorded on the TRT graph stream; postproc consumer .wait()s on that event from the post_process_stream. Forward returns to CPU immediately. 4. Cached threshold tensor (triton_fullpostproc._prepare_threshold): - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD). - Now cached per (value, device) — single HtoD across the run. 5. Adapter simplification (inference_models_adapters.py): - Removed the GPU-side mask_any filter + masked_select + nonempty_idx DtoH/HtoD roundtrip. det.mask from the Triton path is already compact (n_survivors rows); masks2poly fast-skips empty masks internally via np.any, so CPU-side filtering is adequate. - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather kernel + 1 torch.as_tensor HtoD per frame. Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11). nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ansfer Prior layout had 4 separate small .cpu() calls in the adapter: 4B counter.item() 48B xyxy.cpu() 12B conf.cpu() 12B class_id.cpu() Each has ~25-40us of Python+sync overhead. Collapsing them to 1 transfer cuts 4 events with 3 bubbles down to 2 events with 1 bubble. Change: filter kernel now writes a single combined (num_queries, 6) int32 buffer packing [x1, y1, x2, y2, conf_as_int32_bits, class_id] per slot. Adapter does one .cpu() of combined and bitcasts column 4 to fp32 on the host side via numpy.view(np.float32). counter.item() stays inside the Triton wrapper (must be inside the postproc stream context). nsys per-frame postproc window (after mask kernel -> final DtoH): v11: 4 small DtoH + 3 bubbles = ~130us of plumbing v13: 1 small DtoH + 1 bubble = ~75us of plumbing Bubbles: 28% (v11) -> 23% (v13). FPS: 120.37 (v11) -> 120.49 (v13). Small absolute FPS change because the saved plumbing time is already tiny in absolute terms — the wins on this scale are about reducing bubble density, which matters more on slower devices (Jetson). Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.

…ansfers Replace the adapter's two synchronous .cpu() calls with pinned-memory .copy_(non_blocking=True) + a single stream.synchronize(). Both the 72B combined buffer and the 307KB mask buffer now pipeline on the copy engine in parallel rather than chaining. Pinned buffers cached per (name, dtype) and reused across frames, growing if a larger shape is ever requested (unlikely in practice — num_queries is 100 and orig H/W fixed per video). nsys per-frame postproc window: v13: 2 DtoH calls with 49us gap between = 95us for plumbing v14: 2 DtoH calls with 15us gap between = 68us for plumbing Bubbles: 23% (v13) -> 24% (v14 — within noise). FPS: 120.49 (v13) -> 121.21 (v14). Small since the transfer time was already dwarfed by Python interpreter overhead (~200us between the mask kernel ending and the adapter's first DtoH). Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.

Two preprocess-side bubble fixes: 1. Eliminate the preproc_buffer -> graph_input_buffer DtoD memcpy. Previously every frame did: - Triton preproc wrote to self._fast_input_buffer (1x3x312x312 fp32) - TRT graph replay copied self._fast_input_buffer into its own input_buffer via trt_cuda_graph_state.input_buffer.copy_(...) - TRT replay read from its graph-owned input_buffer The DtoD was 1168128 B per frame = 508 MB over a 431-frame run. Fix: thread a "use external buffer" hint via a tensor attribute (_trt_reuse_as_input_buffer). When set, _capture_cuda_graph bakes the external tensor's address into the graph instead of allocating its own. On replay, we skip the DtoD entirely by checking data_ptr equality. Result: DtoD per run 439 -> 8 (98% reduction). 2. Pinned-memory HtoD for the raw BGR frame. Previously: torch.from_numpy(np.ascontiguousarray(images)).to(device, non_blocking=True) silently falls back to sync HtoD because the numpy source isn't in pinned memory. Fix: cache a pinned host buffer + GPU buffer on first call, copy the numpy frame into pinned with np.copyto (fast CPU memcpy), then tensor.copy_(pinned, non_blocking=True) for a genuine async HtoD. 5-run avg FPS: 121.21 (v14) -> 122.40 (v15). Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.

…l order Two Triton postproc tweaks to shrink numerical drift vs the non-Triton path: 1. Banker's rounding (half-to-even) for xyxy integer casts, matching torch.round().int() exactly. Catches the rare case where a bbox ends up at an exact x.5 boundary. 2. Reorder the bbox math so FP32 evaluation order matches the baseline: x_min_pct = cx_pct - 0.5 * w_pct x_min_px = x_min_pct * inference_w x_min = (x_min_px - pad_left) * inv_scale_w Previously we scaled early (cx * W, then sub w*W*0.5) which gave different FP32 rounding in the subtract-scaled-values step. Verified via per-frame parity capture on vehicles_312px.mp4 (538 frames): Config Exact match max |Δconf| max |Δxyxy| Baseline 100.0% 0 0 + CUDA graphs only 100.0% 0 0 + Triton postproc (no preproc) 99.8% 0 1 px + Triton postproc + Triton preproc 0.2% 0.44 119 px The Triton post-process path is bit-parity clean with the baseline (99.8% of frames exact match, zero confidence deltas, max 1 pixel bbox shift on one frame). All remaining drift comes from the Triton PREPROCESS path: our Triton bilinear resize matches torch.nn.functional.interpolate(bilinear) to 1e-6, but the baseline uses cv2.resize which has integer-arithmetic bilinear semantics. The ~0.01 preproc pixel drift cascades through the TRT engine to give ~1% conf drift + 9% of bboxes shifted by 1 pixel. No semantic regression: detection totals match exactly (1899 = 1899), detection counts off by 1 on 12/538 frames (2.2%, all at marginal-confidence thresholds), no det with conf > 0.5 ever disappears. NMS tie-break flips produce the few outlier deltas.

aseembits93 · 2026-04-29T22:13:36Z

Closing in favor of #22. Both PRs cover the same optimization arc on the same files; #22's branch (optimize-rfdetr-seg) is strictly ahead of #23's branch (experiment/full-postproc-triton) — every line in #23 is in #22, plus the full Triton postproc fusion, deferred counter sync (W2), and scratch caching (W7) on top. Current benchmark on vehicles_312px.mp4 (538 frames, Tesla T4 FP16): 109 FPS → 151 FPS (+38%), 0-diff parity vs v16 best across all 538 frames.

claude and others added 4 commits April 28, 2026 22:33

initial point

419c5a7

aseembits93 added 3 commits April 29, 2026 03:50

aseembits93 added 3 commits April 29, 2026 04:24

aseembits93 force-pushed the experiment/full-postproc-triton branch from 11b38e7 to df0d306 Compare April 29, 2026 05:22

aseembits93 changed the title ~~experiment(rfdetr-seg): fuse ENTIRE post-process into 2 Triton kernels (+13.9% vs baseline)~~ perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%) Apr 29, 2026

aseembits93 added 2 commits April 29, 2026 18:26

replace benchmark script with minimal InferencePipeline-based version

bd4792d

aseembits93 mentioned this pull request Apr 29, 2026

perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) #22

Open

7 tasks

aseembits93 closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23

perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23
aseembits93 wants to merge 12 commits into
mainfrom
experiment/full-postproc-triton

aseembits93 commented Apr 29, 2026 •

edited

Loading

Uh oh!

aseembits93 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit-by-commit changes

1. Fuse ENTIRE post-process into 2 Triton kernels (+13.9%)

2. Adapter-level refinements on top of full-fusion (+1.8%)

3. Kill bubbles in post-process (+10.2%)

4. Remove CPU-side sync between filter and mask Triton kernels (+0.6%)

5. Drop unnecessary .bool() casts in postproc return path (+0.5%)

6. Batch scalar DtoH into single 72B combined transfer (+0.1%)

7. Pinned host buffers + async DtoH for final transfers (+0.6%)

8. Eliminate preproc → TRT DtoD copy + pinned HtoD (+1.0%)

Remaining bubbles (future work)

Files changed

Test plan

Uh oh!

aseembits93 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented Apr 29, 2026 •

edited

Loading

5. Drop unnecessary `.bool()` casts in postproc return path (+0.5%)