Skip to content

perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23

Closed
aseembits93 wants to merge 12 commits into
mainfrom
experiment/full-postproc-triton
Closed

perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%)#23
aseembits93 wants to merge 12 commits into
mainfrom
experiment/full-postproc-triton

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

@aseembits93 aseembits93 commented Apr 29, 2026

Summary

Metric Baseline Final Δ
Avg FPS (Tesla T4, example_video.mp4, 431 frames, 5-run avg) 93.07 122.40 +31.5%
Post-process bubble ratio (nsys) 83% 24% -59 pts
GPU kernels per run 133,821 ~7,000 -94%
DtoD/PtoP memcpys per run 439 8 -98%
Postproc kernel cub/torch ops per frame ~45 ~3 -93%
Parity vs baseline 4/431 frames ±1 detection within 5% tolerance
Unit tests 11/11 pass

Flags to enable (all default off):

  • RFDETR_USE_TRITON_PREPROC=true — fused Triton preprocessing kernel
  • RFDETR_TRITON_FULLPOSTPROC=true — fused Triton post-processing kernels
  • ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true — enable CUDA-graph replay for TRT forward

Commit-by-commit changes

Each row shows the 5-run avg FPS after the commit lands, and the delta over the prior commit.

# Commit Avg FPS Δ over prev Δ vs baseline
0 baseline (before this PR) 93.07
1 fuse ENTIRE post-process into 2 Triton kernels 106.05 +13.9% +13.9%
2 adapter-level refinements on top of full-fusion 107.94 +1.8% +15.9%
3 kill bubbles in post-process (+10.3% FPS on top of v3) 119.00 +10.2% +27.9%
4 remove CPU-side sync between filter and mask Triton kernels 119.75 +0.6% +28.7%
5 drop unnecessary .bool() casts in postproc return path 120.37 +0.5% +29.3%
6 batch scalar DtoH into single 72B combined transfer 120.49 +0.1% +29.5%
7 pinned host buffers + async DtoH for final transfers 121.21 +0.6% +30.2%
8 eliminate preproc->TRT DtoD copy + pinned HtoD 122.40 +1.0% +31.5%

1. Fuse ENTIRE post-process into 2 Triton kernels (+13.9%)

Replaces the whole post-TRT chain (sigmoid + argmax + class-remap + conf threshold + xywh→xyxy + denorm + pad subtract + scale divide + clip + mask bilinear upsample + threshold > 0) with two Triton kernel launches.

  • _rfdetr_fullpost_filter_kernel — grid = num_queries. Per query: fused filter + box denorm + clip. Writes padded fixed-shape outputs (keep marks filtered rows).
  • _rfdetr_fullpost_mask_kernel_compact — grid = n_survivors × tile_y × tile_x. Per tile: inverse-map orig→mask coords, bilinear upsample, threshold > 0, store uint8. Uses survivor_idx so only filtered masks get processed (30× less compute than the fixed-padded variant).

Key insight: "fuse into fixed-shape padded output" only wins when the kernel cost scales with inputs (filter step = 100 queries × 91 classes, constant). Per-output-pixel kernels need explicit compaction ("zero out filtered rows" is a trap when 95% of output pixels are wasted work).

2. Adapter-level refinements on top of full-fusion (+1.8%)

Three eliminations of downstream kernels via in-kernel work:

  • Filter kernel rounds + casts xyxy to int32 internally → removes 431 round_kernel_cuda per run
  • Mask kernel accumulates per-survivor mask_any via tl.atomic_max → removes 314 or_kernel_cuda reduces per run
  • Compact (n_survivors)-shape outputs → removes per-frame torch.zeros fill (20 ms) + 4 [keep] indexing kernels per frame

Kernels dropped: 10,900 → 4,669 (-57%). GPU postproc time: 60 ms → 17 ms (-72%).

3. Kill bubbles in post-process (+10.2%)

Biggest single jump. Five eliminations combined:

  1. Atomic-counter compaction in filter kernel — each surviving query does slot = tl.atomic_add(counter_ptr, 1) and writes outputs to compact[slot], skipping filtered queries entirely. Absorbs 13 downstream kernels (keep.bool, the 3-cub nonzero() pipeline, 3 index_select gathers, torch.zeros, .bool() cast).
  2. Removed TRT output-buffer clone in trt.py:711. Graph replay writes to fixed output buffers; instead of 3× DtoD memcpy per frame to protect readers, replaced with a consumer_done_event on TRTCudaGraphState that the next replay waits on.
  3. Removed outer stream.synchronize() in infer_from_trt_engine. Was blocking CPU after every forward. Replaced with a cross-stream cuda.Event that postproc waits on from its own stream.
  4. Cached threshold tensor — was allocating torch.tensor([thr]) (4 B HtoD) per frame. Now cached per (value, device).
  5. Simplified adapter — removed the GPU-side .any() + masked_select + nonempty_idx DtoH/HtoD roundtrip. Since det.mask from the Triton path is already compact and masks2poly fast-skips empty masks, CPU-side filter is adequate.

Postproc bubble ratio: 83% → 35%.

4. Remove CPU-side sync between filter and mask Triton kernels (+0.6%)

The filter→mask bubble was 146 µs per frame caused by int(counter.item()) blocking the host between the two Triton launches. Fix: launch the mask kernel with grid=(num_queries, tile_y, tile_x) and have each program read counter[0] on the GPU and early-exit if its s index exceeds n_survivors. Host reads counter only after the mask kernel is queued, so the two kernels pipeline on the stream. Tradeoff: mask kernel runtime grew 27 → 112 µs (scheduler iterates ~100 programs, most early-exit), net gain ~61 µs/frame.

Also cached mask_bin at max (num_queries, H, W) on module scope to skip per-frame torch.empty() launch.

5. Drop unnecessary .bool() casts in postproc return path (+0.5%)

Two torch.to(torch.bool) kernels per frame from dead conversions (common.py:125 and the wrapper return). Removed, no functional impact. ~1000 kernel launches deleted per run.

6. Batch scalar DtoH into single 72B combined transfer (+0.1%)

Four separate small .cpu() calls in the adapter (4B counter + 48B xyxy + 12B conf + 12B class_id), each with ~25-40 µs Python+sync overhead. Collapsed to one transfer: filter kernel now writes a single combined (num_queries, 6) int32 buffer packing [x1, y1, x2, y2, conf_as_i32_bits, class_id] per slot. Adapter does one .cpu() and bitcasts column 4 to fp32 via numpy.view(np.float32).

7. Pinned host buffers + async DtoH for final transfers (+0.6%)

Replace the adapter's two synchronous .cpu() calls with pinned-memory .copy_(non_blocking=True) + a single stream.synchronize(). Both the 72B combined buffer and the 307KB mask buffer now pipeline on the copy engine in parallel rather than chaining. Pinned buffers cached per (name, dtype) and reused across frames.

DtoH timing: 49 µs gap between scalar + mask transfers → 15 µs.

8. Eliminate preproc → TRT DtoD copy + pinned HtoD (+1.0%)

Two preprocess-side fixes:

Eliminate the per-frame DtoD from _fast_input_buffer → graph_input_buffer (1,168,128 B per frame = 508 MB per run). Threaded a _trt_reuse_as_input_buffer tensor-attribute hint through _capture_cuda_graph. When set, the graph bakes the external tensor's address as its input buffer. On replay, skip the DtoD when data_ptr matches. Result: DtoD/PtoP memcpys per run 439 → 8 (-98%).

Pinned-memory HtoD for the raw BGR frame. Was using torch.from_numpy(ascontiguousarray(img)).to(device, non_blocking=True) — but without pinned source, non_blocking=True silently falls back to sync. Now uses a cached pinned host buffer + GPU buffer, copies the numpy frame into pinned with np.copyto, then tensor.copy_(pinned, non_blocking=True) for a genuine async HtoD.


Remaining bubbles (future work)

From the final v15 profile:

  • ~175 µs Python interpreter overhead between _rfdetr_stretch_preprocess_kernel completing and cudaGraphLaunch — pure call-chain cost across pre_processinferforwardinfer_from_trt_engine → 4 more wrapper layers → cuda_graph.replay. Eliminable via call-chain flattening.
  • ~200 µs Python overhead in the adapter between the mask kernel ending and the first .cpu() call. Same category — CPython cost, not GPU.
  • CPU-GPU pipelining: currently serial per frame. GPU is idle ~50% of wall time waiting for CPU plumbing. A ping-pong double-buffer scheme could reclaim another +40-55% FPS (estimated ceiling ~170-190 FPS).

Files changed

File Purpose
inference_models/models/rfdetr/triton_fullpostproc.py New. Two Triton kernels + host wrapper for the fused post-process.
inference_models/models/rfdetr/triton_preprocess.py Existing (PR #22). Used by fast-path preproc.
inference_models/models/rfdetr/common.py Wires RFDETR_TRITON_FULLPOSTPROC into post_process_instance_segmentation_results.
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py Pinned HtoD, buffer reuse, _trt_reuse_as_input_buffer marker, event-based cross-stream ordering.
inference_models/models/common/trt.py _trt_reuse_as_input_buffer support, removed output-buffer clone, cross-stream event plumbing.
inference/core/models/inference_models_adapters.py Batched combined-buffer DtoH, pinned async DtoH, simplified mask filter.

Test plan

  • pytest tests/inference/unit_tests/models/test_rfdetr.py -x -q — 11/11 pass
  • Benchmark: 5-run avg 122.40 ± 0.9 FPS
  • Parity: 4/431 frames differ by ±1 detection vs baseline (same 4 frames as PR perf(rfdetr-seg): Triton fusion + CUDA graphs + scratch caching (109 → 151 FPS on vehicles_312px) #22; attributable to Triton bilinear vs torchvision antialias at mask edges)
  • nsys profiles saved at /tmp/nsys_profiles/rfdetr_trt_fullpost_v{3..15}.nsys-rep
  • Test on higher-res video (mask kernel compute scales with orig_h × orig_w)
  • Test on Jetson Orin (pinned-memory + async-transfer wins should scale better there)

claude and others added 4 commits April 28, 2026 22:33
… stream sync reduction

Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:

  Baseline (no changes)                            93.07 FPS
  + Triton preprocess (fused resize+BGR2RGB+norm)  ~93 FPS   (U6)
  + U7 mask-decode skip for empty masks            ~94 FPS   (flag-gated)
  + Triton postprocess conf-filter                  98.6 FPS (+5.9%)
  + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True  102.1 FPS  (+9.7%)
  + Drop pre/post stream syncs                    102.2 FPS  (+9.8%)

Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).

Changes (all flag-gated, opt-in):

inference_models/models/rfdetr/triton_preprocess.py (new)
  One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
  normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
  kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.

inference_models/models/rfdetr/triton_postprocess.py (new)
  One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
  confidence-threshold filter. Replaces ~14k small cub/torch kernels with
  431 (1 per frame). Supports both per-class threshold vector and scalar,
  with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
  - Wire the Triton preprocess fast-path in pre_process() with a guarded
    dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
  - Cache pre-allocated input buffer and normalization constants on model
    instance on first call.
  - Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
    on the inference stream so the CPU doesn't stall waiting for the
    preprocessing Triton kernel to finish.
  - Drop the post_process_stream.synchronize() (the adapter's subsequent
    .cpu() calls provide the implicit sync).

inference_models/models/rfdetr/common.py
  Wire the Triton postprocess conf-filter into
  post_process_instance_segmentation_results. Falls back to torch path
  when the model has no remapping table, is CPU-bound, or Triton is
  unavailable.

inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
  Same Triton preprocess kernel + dispatch for the legacy inference
  package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
  default routes to inference_models adapters) but kept for parity so the
  legacy path benefits if exercised.

inference/core/models/inference_models_adapters.py
  GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
  on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
  RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
  the reference path.

Env vars introduced:
  RFDETR_USE_TRITON_PREPROC=true         opt-in; fused preproc kernel
  RFDETR_TRITON_POSTPROC=true            opt-in; fused postproc conf filter
  RFDETR_GPU_POSTPROCESS=true            default on; GPU mask emptiness skip
  RFDETR_DISABLE_GPU_PREPROC=true        opt-out; disable torch GPU preproc
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True   enables existing TRT CUDA graph cache

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (+3.3% FPS vs partial)

Experimental follow-up to the shipped partial-fusion PR. Replaces the whole
post-TRT chain (sigmoid, argmax over classes, class remap, conf threshold,
xywh->xyxy, denorm to orig image, padding subtract, scale divide, clip,
mask bilinear upsample, threshold>0) with two Triton kernel launches.

Results (5-run avg on Tesla T4, example_video, 431 frames):

  Baseline                                     93.07 FPS
  Partial fusion (RFDETR_TRITON_POSTPROC)     102.74 FPS  (+10.4%)
  FULL fusion v1 (RFDETR_TRITON_FULLPOSTPROC) 101.07 FPS  (slower than partial!)
  FULL fusion v2 (compact mask kernel)        106.05 FPS  (+13.9% vs baseline, +3.3% vs partial)

Why v1 was slower:
  The fixed-grid "run mask kernel on all 100 queries, zero out filtered rows"
  approach did 30x more GPU work than torch because only 3-7 masks typically
  survive filtering out of 100, but the kernel processed all of them (10.2M
  pixels/frame vs torch's ~300k).

v2 fix:
  Added `_rfdetr_fullpost_mask_kernel_compact` which takes an explicit
  (n_survivors,) survivor_idx tensor. Host uses keep.nonzero() + a small
  DtoH sync to size the kernel launch. Trades one tiny sync (<5 us) for
  30x less compute in the mask upsample.

nsys kernel counts (per 431-frame run):
                          total_k   gpu_ms  mask_kernel_ms
  Partial                  117,973    2011          10.9
  Full v1 (bad)             10,394     393         342.0
  Full v2 (good)            12,315      82           3.5

Parity: 4/431 frames differ by +/-1 detection vs baseline, same 4 frames as
the shipped partial fusion (Triton bilinear vs torchvision antialias at mask
edges). Unit tests pass (11/11).

Files:

  inference_models/models/rfdetr/triton_fullpostproc.py (new)
    _rfdetr_fullpost_filter_kernel: num_queries programs; per-query fuses
      sigmoid(max(logits)) + argmax + class remap + conf threshold + box
      denorm + xywh->xyxy + pad subtract + scale divide + clip + store
      padded outputs (keep flag for filtered rows).
    _rfdetr_fullpost_mask_kernel_compact: n_survivors * tile_y * tile_x
      programs; per-tile bilinear upsample 78x78 -> orig_h x orig_w with
      survivor_idx lookup, threshold > 0, store as uint8.

  inference_models/models/rfdetr/common.py (modified)
    New fast path at top of post_process_instance_segmentation_results
    gated by RFDETR_TRITON_FULLPOSTPROC=true. Guards: batch=1,
    nonsquare_intermediate_size is None, static_crop_offset is (0,0),
    classes_re_mapping is not None (rfdetr-seg-nano default).

Env var: RFDETR_TRITON_FULLPOSTPROC=true

Notes / future work:
- The shipped `RFDETR_TRITON_POSTPROC` partial-fusion flag is still useful
  as a simpler, more-portable fallback. This full-fusion path covers more
  ground but has tighter config guards.
- The key lesson: "fuse into fixed-shape padded output" is only a win when
  the kernel's cost scales with inputs (the filter step), NOT outputs (the
  mask resize). Per-output-pixel kernels need compaction.
…on (+1.8% FPS)

Follow-up optimizations on the full-postproc fusion to eliminate the
downstream plumbing kernels still visible in nsys after PR #23.

Results (5-run avg on Tesla T4, example_video, 431 frames):

  Baseline                             93.07 FPS
  Partial fusion (PR #22)             102.74 FPS  (+10.3%)
  Full fusion v2 (PR #23 initial)     106.05 FPS  (+13.9%)
  Full fusion v3 (this commit)        107.94 FPS  (+15.9%)

Changes:

1. `triton_fullpostproc.py`:
   - Filter kernel: rounds + casts xyxy to int32 inside the kernel
     (replaces a downstream `.round().int()` elementwise kernel).
   - Mask kernel: accumulates per-survivor `mask_any` via tile-level
     `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))`
     reduce. Writes to compact `(n_survivors, H, W)` output instead of
     padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros`
     fill kernel.
   - Host wrapper: returns compact tensors directly (xyxy_int32,
     conf, cls_id, mask_bin uint8, mask_any bool), all sized to
     n_survivors. The single `keep.nonzero()` remains as the only DtoH
     sync in the fused path.

2. `common.py`:
   - Build `InstanceDetections` with compact tensors directly; skip
     the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing
     chain that added 5 kernels per frame.
   - Attach precomputed `mask_any` onto the detections object via
     `__dict__` so the adapter can use it without recomputation.

3. `inference_models_adapters.py`:
   - Use `getattr(det, "mask_any", None)` from the full-fusion path
     when available; fall back to `.any(dim=(1,2))` otherwise.

nsys kernel count shrinkage (per 431-frame run):
                          total_k   legacy_pp   pp_ms
  FULL v2                  12,315     10,900    60ms
  FULL v3                   7,026      4,669    17ms   (-43% total, -57% pp, -72% pp_ms)

Specifically eliminated:
- DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero()
  calls per frame collapsed to 1).
- FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros).
- or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton).
- round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel).
- direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select
  on scalar fields, but they're tiny).

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames
as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).
aseembits93 pushed a commit that referenced this pull request Apr 29, 2026
…on (+1.8% FPS)

Follow-up optimizations on the full-postproc fusion to eliminate the
downstream plumbing kernels still visible in nsys after PR #23.

Results (5-run avg on Tesla T4, example_video, 431 frames):

  Baseline                             93.07 FPS
  Partial fusion (PR #22)             102.74 FPS  (+10.3%)
  Full fusion v2 (PR #23 initial)     106.05 FPS  (+13.9%)
  Full fusion v3 (this commit)        107.94 FPS  (+15.9%)

Changes:

1. `triton_fullpostproc.py`:
   - Filter kernel: rounds + casts xyxy to int32 inside the kernel
     (replaces a downstream `.round().int()` elementwise kernel).
   - Mask kernel: accumulates per-survivor `mask_any` via tile-level
     `tl.atomic_max`, eliminating a downstream `det.mask.any(dim=(1,2))`
     reduce. Writes to compact `(n_survivors, H, W)` output instead of
     padded `(num_queries, H, W)` — removes the 20 ms/run `torch.zeros`
     fill kernel.
   - Host wrapper: returns compact tensors directly (xyxy_int32,
     conf, cls_id, mask_bin uint8, mask_any bool), all sized to
     n_survivors. The single `keep.nonzero()` remains as the only DtoH
     sync in the fused path.

2. `common.py`:
   - Build `InstanceDetections` with compact tensors directly; skip
     the `xyxy[keep].round().int()` / `.bool()` / `[keep]` indexing
     chain that added 5 kernels per frame.
   - Attach precomputed `mask_any` onto the detections object via
     `__dict__` so the adapter can use it without recomputation.

3. `inference_models_adapters.py`:
   - Use `getattr(det, "mask_any", None)` from the full-fusion path
     when available; fall back to `.any(dim=(1,2))` otherwise.

nsys kernel count shrinkage (per 431-frame run):
                          total_k   legacy_pp   pp_ms
  FULL v2                  12,315     10,900    60ms
  FULL v3                   7,026      4,669    17ms   (-43% total, -57% pp, -72% pp_ms)

Specifically eliminated:
- DeviceSelectSweep/Reduce/CompactInit triple: 2155 -> 431 each (5 nonzero()
  calls per frame collapsed to 1).
- FillFunctor<uint8>: 431 -> 0 (no more padded-output torch.zeros).
- or_kernel_cuda reduce: 314 -> 0 (mask_any precomputed by Triton).
- round_kernel_cuda: 314 -> 0 (xyxy rounded inside Triton kernel).
- direct_copy_kernel: 1804 -> ~2746 total (slight increase due to index_select
  on scalar fields, but they're tiny).

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames
as PR #22 and PR #23 v2, no new divergence. Unit tests pass (11/11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…op of v3)

Results (5-run avg, Tesla T4):

  Baseline                    93.07 FPS
  PR #23 v3                  107.94 FPS  (+15.9% vs baseline)
  + this commit (v8)         119.0  FPS  (+27.9% vs baseline, +10.3% vs v3)

Changes:

1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel):
   - Added `counter` param (int32 atomic). Each surviving query does
     `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to
     compact[slot], skipping filtered queries entirely.
   - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping
     so the mask kernel can gather the right input row without a separate
     nonzero() pass on the host.
   - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write
     time; mask kernel then atomic_max's up to 1 only if any pixel survives
     the threshold. Removes a torch.zeros launch per frame.
   - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub
     pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.)
     into one kernel.
   - Output ordering is non-deterministic across survivors (atomic-add
     ordering), but downstream doesn't require sorted order — we already
     skipped torch.sort in earlier work.

2. Removed TRT output-buffer clone (trt.py:711):
   - Graph replay writes to fixed output buffers every frame. We used to
     .clone() 3 times per frame (3 DtoD memcpys) to protect postproc
     readers from the next frame's replay overwriting data.
   - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState:
     postproc records when it's done reading, next TRT replay .wait()s on
     the event before submitting. Zero DtoD clones, proper cross-stream
     ordering.

3. Removed the outer stream.synchronize() in infer_from_trt_engine:
   - Was blocking CPU after every TRT forward. Replaced with a cuda.Event
     recorded on the TRT graph stream; postproc consumer .wait()s on that
     event from the post_process_stream. Forward returns to CPU immediately.

4. Cached threshold tensor (triton_fullpostproc._prepare_threshold):
   - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD).
   - Now cached per (value, device) — single HtoD across the run.

5. Adapter simplification (inference_models_adapters.py):
   - Removed the GPU-side mask_any filter + masked_select + nonempty_idx
     DtoH/HtoD roundtrip. det.mask from the Triton path is already
     compact (n_survivors rows); masks2poly fast-skips empty masks
     internally via np.any, so CPU-side filtering is adequate.
   - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather
     kernel + 1 torch.as_tensor HtoD per frame.

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as
prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11).

nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8).
…Triton kernels

The filter->mask bubble in nsys was 146us per frame, caused by `int(counter.item())`
blocking the host between the two Triton launches. Fix: launch the mask kernel
with `grid=(num_queries, tile_y, tile_x)` and have each program read `counter[0]`
on the GPU and early-exit if its `s` index exceeds `n_survivors`. Host reads
counter only AFTER the mask kernel is queued, letting the two kernels pipeline
on the stream.

Tradeoffs:
- Mask kernel runtime grew 27us -> 112us (scheduler iterates ~100 programs,
  most early-exit). Net gain: ~61us/frame.
- Still need one DtoH (4 bytes) at the very end to get the final n_survivors
  for tensor views. That's the single remaining mandatory sync.

Also cached the mask_bin tensor at (num_queries, H, W) on module scope to
skip the per-frame torch.empty() launch (~20us saved).

nsys per-frame postproc bubble ratio: 35% (v8) -> 31% (v10).

Results:
  v8  avg FPS: 119.0
  v10 avg FPS: 119.75  (+0.6% over v8, +28.7% over baseline)

Parity: 4/431 frames differ by +/-1 detection vs baseline (same 4 as before).
Unit tests pass (11/11).
…turn path

Two torch .to(torch.bool) kernels per frame (431 + 628 launches) from:
1. common.py:125 `mask=mask_bin.to(torch.bool)` — unused downstream
2. triton_fullpostproc.py return `mask_any_view.bool()` — also unused

Both are trivial but remove ~1000 kernel launches per run. Small FPS gain
(119.75 -> 120.37, +0.5%) and cleaner profile.

nsys per-frame postproc bubble ratio: 31% (v10) -> 28% (v11).

Remaining bubbles are dominated by:
- 87us initial wait for TRT engine (unavoidable without pipelining frames)
- 173us CPU-Python overhead between counter.item() sync and first .cpu()
- 3x 25-40us tiny DtoH gaps (launch overhead, not transfer time)

These all stem from Python interpreter cost around `.cpu()` calls. Further
reduction needs structural changes (frame pipelining) or C-extension work.

Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
aseembits93 pushed a commit that referenced this pull request Apr 29, 2026
…op of v3)

Results (5-run avg, Tesla T4):

  Baseline                    93.07 FPS
  PR #23 v3                  107.94 FPS  (+15.9% vs baseline)
  + this commit (v8)         119.0  FPS  (+27.9% vs baseline, +10.3% vs v3)

Changes:

1. Atomic-counter compaction in filter kernel (_rfdetr_fullpost_filter_kernel):
   - Added `counter` param (int32 atomic). Each surviving query does
     `slot = tl.atomic_add(counter_ptr, 1)` and writes its outputs to
     compact[slot], skipping filtered queries entirely.
   - Added `survivor_idx_out_ptr`: filter kernel records pid->slot mapping
     so the mask kernel can gather the right input row without a separate
     nonzero() pass on the host.
   - Added `mask_any_out_ptr`: filter kernel zeroes mask_any[slot] at write
     time; mask kernel then atomic_max's up to 1 only if any pixel survives
     the threshold. Removes a torch.zeros launch per frame.
   - Result: collapsed ~13 downstream kernels (keep.bool, nonzero's 3-cub
     pipeline, 3 index_select gathers, torch.zeros, .bool() cast, etc.)
     into one kernel.
   - Output ordering is non-deterministic across survivors (atomic-add
     ordering), but downstream doesn't require sorted order — we already
     skipped torch.sort in earlier work.

2. Removed TRT output-buffer clone (trt.py:711):
   - Graph replay writes to fixed output buffers every frame. We used to
     .clone() 3 times per frame (3 DtoD memcpys) to protect postproc
     readers from the next frame's replay overwriting data.
   - Replaced with an explicit consumer_done cuda.Event on TRTCudaGraphState:
     postproc records when it's done reading, next TRT replay .wait()s on
     the event before submitting. Zero DtoD clones, proper cross-stream
     ordering.

3. Removed the outer stream.synchronize() in infer_from_trt_engine:
   - Was blocking CPU after every TRT forward. Replaced with a cuda.Event
     recorded on the TRT graph stream; postproc consumer .wait()s on that
     event from the post_process_stream. Forward returns to CPU immediately.

4. Cached threshold tensor (triton_fullpostproc._prepare_threshold):
   - Was allocating `torch.tensor([thr])` on GPU every frame (4B HtoD).
   - Now cached per (value, device) — single HtoD across the run.

5. Adapter simplification (inference_models_adapters.py):
   - Removed the GPU-side mask_any filter + masked_select + nonempty_idx
     DtoH/HtoD roundtrip. det.mask from the Triton path is already
     compact (n_survivors rows); masks2poly fast-skips empty masks
     internally via np.any, so CPU-side filtering is adequate.
   - Removed 3 cub kernels (DeviceReduce/Compact/Select) + 1 gather
     kernel + 1 torch.as_tensor HtoD per frame.

Parity: 4/431 frames differ by +/-1 detection vs baseline — same 4 frames as
prior PR (Triton bilinear vs torchvision antialias). Unit tests pass (11/11).

nsys per-frame postproc bubble ratio: 83% (v3) -> 35% (v8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ansfer

Prior layout had 4 separate small .cpu() calls in the adapter:
  4B    counter.item()
  48B   xyxy.cpu()
  12B   conf.cpu()
  12B   class_id.cpu()

Each has ~25-40us of Python+sync overhead. Collapsing them to 1 transfer
cuts 4 events with 3 bubbles down to 2 events with 1 bubble.

Change: filter kernel now writes a single combined (num_queries, 6) int32
buffer packing [x1, y1, x2, y2, conf_as_int32_bits, class_id] per slot.
Adapter does one .cpu() of combined and bitcasts column 4 to fp32 on the
host side via numpy.view(np.float32). counter.item() stays inside the
Triton wrapper (must be inside the postproc stream context).

nsys per-frame postproc window (after mask kernel -> final DtoH):
  v11: 4 small DtoH + 3 bubbles = ~130us of plumbing
  v13: 1 small DtoH + 1 bubble  =  ~75us of plumbing

Bubbles: 28% (v11) -> 23% (v13).

FPS: 120.37 (v11) -> 120.49 (v13). Small absolute FPS change because the
saved plumbing time is already tiny in absolute terms — the wins on this
scale are about reducing bubble density, which matters more on slower
devices (Jetson).

Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
…ansfers

Replace the adapter's two synchronous .cpu() calls with pinned-memory
.copy_(non_blocking=True) + a single stream.synchronize(). Both the 72B
combined buffer and the 307KB mask buffer now pipeline on the copy engine
in parallel rather than chaining.

Pinned buffers cached per (name, dtype) and reused across frames,
growing if a larger shape is ever requested (unlikely in practice —
num_queries is 100 and orig H/W fixed per video).

nsys per-frame postproc window:
  v13: 2 DtoH calls with 49us gap between = 95us for plumbing
  v14: 2 DtoH calls with 15us gap between = 68us for plumbing

Bubbles: 23% (v13) -> 24% (v14 — within noise).

FPS: 120.49 (v13) -> 121.21 (v14). Small since the transfer time was
already dwarfed by Python interpreter overhead (~200us between the
mask kernel ending and the adapter's first DtoH).

Parity: 4/431 frames differ (same 4). Unit tests 11/11 pass.
Two preprocess-side bubble fixes:

1. Eliminate the preproc_buffer -> graph_input_buffer DtoD memcpy.
   Previously every frame did:
     - Triton preproc wrote to self._fast_input_buffer (1x3x312x312 fp32)
     - TRT graph replay copied self._fast_input_buffer into its own
       input_buffer via trt_cuda_graph_state.input_buffer.copy_(...)
     - TRT replay read from its graph-owned input_buffer
   The DtoD was 1168128 B per frame = 508 MB over a 431-frame run.

   Fix: thread a "use external buffer" hint via a tensor attribute
   (_trt_reuse_as_input_buffer). When set, _capture_cuda_graph bakes the
   external tensor's address into the graph instead of allocating its own.
   On replay, we skip the DtoD entirely by checking data_ptr equality.

   Result: DtoD per run 439 -> 8 (98% reduction).

2. Pinned-memory HtoD for the raw BGR frame.
   Previously: torch.from_numpy(np.ascontiguousarray(images)).to(device,
     non_blocking=True) silently falls back to sync HtoD because the numpy
     source isn't in pinned memory.

   Fix: cache a pinned host buffer + GPU buffer on first call, copy the
   numpy frame into pinned with np.copyto (fast CPU memcpy), then
   tensor.copy_(pinned, non_blocking=True) for a genuine async HtoD.

5-run avg FPS: 121.21 (v14) -> 122.40 (v15). Parity: 4/431 frames differ
(same 4). Unit tests 11/11 pass.
@aseembits93 aseembits93 force-pushed the experiment/full-postproc-triton branch from 11b38e7 to df0d306 Compare April 29, 2026 05:22
@aseembits93 aseembits93 changed the title experiment(rfdetr-seg): fuse ENTIRE post-process into 2 Triton kernels (+13.9% vs baseline) perf(rfdetr-seg): Triton fusion + pipelining cuts post/pre-proc bubbles (93 → 122 FPS, +31.5%) Apr 29, 2026
…l order

Two Triton postproc tweaks to shrink numerical drift vs the non-Triton path:

1. Banker's rounding (half-to-even) for xyxy integer casts, matching
   torch.round().int() exactly. Catches the rare case where a bbox ends
   up at an exact x.5 boundary.

2. Reorder the bbox math so FP32 evaluation order matches the baseline:
     x_min_pct = cx_pct - 0.5 * w_pct
     x_min_px  = x_min_pct * inference_w
     x_min     = (x_min_px - pad_left) * inv_scale_w
   Previously we scaled early (cx * W, then sub w*W*0.5) which gave
   different FP32 rounding in the subtract-scaled-values step.

Verified via per-frame parity capture on vehicles_312px.mp4 (538 frames):

  Config                                       Exact match   max |Δconf|  max |Δxyxy|
  Baseline                                     100.0%        0            0
  + CUDA graphs only                           100.0%        0            0
  + Triton postproc (no preproc)                99.8%        0            1 px
  + Triton postproc + Triton preproc             0.2%        0.44         119 px

The Triton post-process path is bit-parity clean with the baseline
(99.8% of frames exact match, zero confidence deltas, max 1 pixel
bbox shift on one frame).

All remaining drift comes from the Triton PREPROCESS path: our Triton
bilinear resize matches torch.nn.functional.interpolate(bilinear) to
1e-6, but the baseline uses cv2.resize which has integer-arithmetic
bilinear semantics. The ~0.01 preproc pixel drift cascades through the
TRT engine to give ~1% conf drift + 9% of bboxes shifted by 1 pixel.

No semantic regression: detection totals match exactly (1899 = 1899),
detection counts off by 1 on 12/538 frames (2.2%, all at
marginal-confidence thresholds), no det with conf > 0.5 ever disappears.
NMS tie-break flips produce the few outlier deltas.
@aseembits93
Copy link
Copy Markdown
Owner Author

Closing in favor of #22. Both PRs cover the same optimization arc on the same files; #22's branch (optimize-rfdetr-seg) is strictly ahead of #23's branch (experiment/full-postproc-triton) — every line in #23 is in #22, plus the full Triton postproc fusion, deferred counter sync (W2), and scratch caching (W7) on top. Current benchmark on vehicles_312px.mp4 (538 frames, Tesla T4 FP16): 109 FPS → 151 FPS (+38%), 0-diff parity vs v16 best across all 538 frames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants