Skip to content

Perf/optimize rfdetr seg plus is seg dataclasses copy#31

Open
aseembits93 wants to merge 8 commits into
mainfrom
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy
Open

Perf/optimize rfdetr seg plus is seg dataclasses copy#31
aseembits93 wants to merge 8 commits into
mainfrom
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

No description provided.

claude and others added 8 commits April 28, 2026 22:33
… stream sync reduction

Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:

  Baseline (no changes)                            93.07 FPS
  + Triton preprocess (fused resize+BGR2RGB+norm)  ~93 FPS   (U6)
  + U7 mask-decode skip for empty masks            ~94 FPS   (flag-gated)
  + Triton postprocess conf-filter                  98.6 FPS (+5.9%)
  + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True  102.1 FPS  (+9.7%)
  + Drop pre/post stream syncs                    102.2 FPS  (+9.8%)

Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).

Changes (all flag-gated, opt-in):

inference_models/models/rfdetr/triton_preprocess.py (new)
  One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
  normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
  kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.

inference_models/models/rfdetr/triton_postprocess.py (new)
  One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
  confidence-threshold filter. Replaces ~14k small cub/torch kernels with
  431 (1 per frame). Supports both per-class threshold vector and scalar,
  with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.

inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
  - Wire the Triton preprocess fast-path in pre_process() with a guarded
    dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
  - Cache pre-allocated input buffer and normalization constants on model
    instance on first call.
  - Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
    on the inference stream so the CPU doesn't stall waiting for the
    preprocessing Triton kernel to finish.
  - Drop the post_process_stream.synchronize() (the adapter's subsequent
    .cpu() calls provide the implicit sync).

inference_models/models/rfdetr/common.py
  Wire the Triton postprocess conf-filter into
  post_process_instance_segmentation_results. Falls back to torch path
  when the model has no remapping table, is CPU-bound, or Triton is
  unavailable.

inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
  Same Triton preprocess kernel + dispatch for the legacy inference
  package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
  default routes to inference_models adapters) but kept for parity so the
  legacy path benefits if exercised.

inference/core/models/inference_models_adapters.py
  GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
  on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
  RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
  the reference path.

Env vars introduced:
  RFDETR_USE_TRITON_PREPROC=true         opt-in; fused preproc kernel
  RFDETR_TRITON_POSTPROC=true            opt-in; fused postproc conf filter
  RFDETR_GPU_POSTPROCESS=true            default on; GPU mask emptiness skip
  RFDETR_DISABLE_GPU_PREPROC=true        opt-out; disable torch GPU preproc
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True   enables existing TRT CUDA graph cache

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (W2)

Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):

  _rfdetr_fullpost_filter_kernel  (grid = num_queries)
    sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
    letterbox-denormalize + clip + round; atomic_add into counter to reserve
    a compact output slot.

  _rfdetr_fullpost_mask_kernel_compact  (grid = num_queries * tile_y * tile_x)
    GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
    uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.

Adapter (inference_models_adapters.py):

  - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
    side-channels. Adapter stream waits the done_event, pinned-DtoH's the
    4-byte counter, syncs once to read n_survivors, then slices combined
    and mask to n_survivors and pinned-DtoH's both async, syncing again.

  - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
    postproc stream. Same number of host-visible syncs (2), but the first
    is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
    pinned path so the copy engine overlaps with the compute engine.

TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):

  - Records a produce_event on the graph's own stream so consumers can
    wait_event instead of stream.synchronize(). Removes the unconditional
    stream.synchronize() in infer_from_trt_engine's graph-replay branch.

  - consumer_done_event field on TRTCudaGraphState lets the next graph
    replay chain on the consumer's last use of the output buffers.

  - _trt_reuse_as_input_buffer marker so fast preproc can write directly
    into the graph's captured input buffer, eliminating the per-frame DtoD.

Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):

  v16 baseline (Triton preproc + postproc + CUDA graph)   150 FPS
  + triton_fullpost + deferred counter sync (this commit)  151 FPS

Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).

Env flags:

  RFDETR_TRITON_FULLPOSTPROC=true   opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:

  - direct_copy_kernel_cuda  (538 per 538-frame run on vehicles_312px)
  - vectorized_elementwise_kernel<FillFunctor<int>>  (538 / 538)

direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).

FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).

After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).

Impact:
  - direct_copy: 538 -> 0 (-100%)
  - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
  - torch.empty calls: 3/frame -> 0
  - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
  - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
    binding constraint, not mask kernel GPU time).
…path

Stacks on top of PR#22 (optimize-rfdetr-seg: Triton fusion + CUDA graphs
+ scratch caching). See PR#28 for the same change against main.

`InferenceModelsInstanceSegmentationAdapter.postprocess` built a full
pydantic tree per frame — `Point × V` per polygon vertex,
`InstanceSegmentationPrediction × N`, then
`InstanceSegmentationInferenceResponse`. The workflow block then called
`response.model_dump(by_alias=True, exclude_none=True)` to get a plain
dict for `sv.Detections.from_inference`. Neither validation nor the
serializer is needed on that path — the block only consumes the dict.

This change adds slotted dataclass twins (`PointDC`,
`InferenceResponseImageDC`, `InstanceSegmentationPredictionDC`,
`InstanceSegmentationInferenceResponseDC`) plus `_is_pred_dc_to_dict`
and `_is_response_dc_to_dict` helpers that emit the exact dict
`model_dump(by_alias=True, exclude_none=True)` produces (same keys,
same `class` alias, same None-omission).

The adapter gates on `kwargs.get("source") == "workflow-execution"`
and returns the dataclass response on that path. Every other caller —
HTTP `response_model` at `http_api.py:1640`, `isinstance`-based cache
dispatch at `cache/serializers.py:71`, `draw_predictions`
visualization — keeps the pydantic path untouched.

The v3 workflow block detects the dataclass via `isinstance` and calls
`_is_response_dc_to_dict`; falls back to `model_dump` for any other
response type.

Microbench (4 dets × 6-vertex polygon, construct + dump):
  * pydantic:  ~81 us/frame
  * dataclass: ~34 us/frame  (2.43x faster)

End-to-end (rfdetr-seg-nano TRT + Triton preproc + Triton fullpost +
CUDA graphs, vehicles_312px.mp4, 538 frames, 4 runs each, on top of
optimize-rfdetr-seg HEAD c1406a8):
  * baseline (pydantic): 152.93 FPS mean
  * dataclass:           156.54 FPS mean  (+3.6 FPS, +2.4%)

Bit-exact parity verified: `_is_response_dc_to_dict(dc)` byte-equals
`pyd.model_dump(by_alias=True, exclude_none=True)` for mixed inputs
(varying polygon lengths, empty list, mutation of .time/.inference_id
post-construct).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants