Perf/optimize rfdetr seg plus is seg dataclasses copy by aseembits93 · Pull Request #31 · aseembits93/inference

aseembits93 · 2026-04-30T23:09:26Z

No description provided.

… stream sync reduction Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine, example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes below + enabling the existing CUDA-graph cache: Baseline (no changes) 93.07 FPS + Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6) + U7 mask-decode skip for empty masks ~94 FPS (flag-gated) + Triton postprocess conf-filter 98.6 FPS (+5.9%) + ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%) + Drop pre/post stream syncs 102.2 FPS (+9.8%) Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs cv2.resize rounding at mask boundaries). Unit tests pass (11/11). Changes (all flag-gated, opt-in): inference_models/models/rfdetr/triton_preprocess.py (new) One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true. inference_models/models/rfdetr/triton_postprocess.py (new) One Triton kernel fusing sigmoid + argmax-over-classes + class-remap + confidence-threshold filter. Replaces ~14k small cub/torch kernels with 431 (1 per frame). Supports both per-class threshold vector and scalar, with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true. inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py - Wire the Triton preprocess fast-path in pre_process() with a guarded dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop). - Cache pre-allocated input buffer and normalization constants on model instance on first call. - Replace pre_process_stream.synchronize() with a CUDA event ev.wait() on the inference stream so the CPU doesn't stall waiting for the preprocessing Triton kernel to finish. - Drop the post_process_stream.synchronize() (the adapter's subsequent .cpu() calls provide the implicit sync). inference_models/models/rfdetr/common.py Wire the Triton postprocess conf-filter into post_process_instance_segmentation_results. Falls back to torch path when the model has no remapping table, is CPU-bound, or Triton is unavailable. inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path) Same Triton preprocess kernel + dispatch for the legacy inference package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS default routes to inference_models adapters) but kept for parity so the legacy path benefits if exercised. inference/core/models/inference_models_adapters.py GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2)) on GPU, only DtoH + cv2.findContours non-empty masks. Gated via RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to the reference path. Env vars introduced: RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (W2) Adds triton_fullpostproc.py with two fused Triton kernels that replace the entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no static crop, stretch-to resize, class remapping active): _rfdetr_fullpost_filter_kernel (grid = num_queries) sigmoid argmax + class remap + conf threshold + cxcywh->xyxy + letterbox-denormalize + clip + round; atomic_add into counter to reserve a compact output slot. _rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x) GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 + uint8 emit. Early-exits on s >= counter[0] without an intermediate sync. Adapter (inference_models_adapters.py): - New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event side-channels. Adapter stream waits the done_event, pinned-DtoH's the 4-byte counter, syncs once to read n_survivors, then slices combined and mask to n_survivors and pinned-DtoH's both async, syncing again. - Replaces the prior in-Triton int(counter.item()) that CPU-blocked the postproc stream. Same number of host-visible syncs (2), but the first is a 4-byte DtoH instead of a stream drain, and both are on a dedicated pinned path so the copy engine overlaps with the compute engine. TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py): - Records a produce_event on the graph's own stream so consumers can wait_event instead of stream.synchronize(). Removes the unconditional stream.synchronize() in infer_from_trt_engine's graph-replay branch. - consumer_done_event field on TRTCudaGraphState lets the next graph replay chain on the consumer's last use of the output buffers. - _trt_reuse_as_input_buffer marker so fast preproc can write directly into the graph's captured input buffer, eliminating the per-frame DtoD. Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine): v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS + triton_fullpost + deferred counter sync (this commit) 151 FPS Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy, conf, class_id, and mask MD5 per detection). Env flags: RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path

Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc path that shouldn't be there: - direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px) - vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538) direct_copy was class_mapping.to(dtype=torch.int32) firing every frame — upstream stores the mapping as int64, our Triton kernel needs int32, and the wrapper re-converts on every call since the dtype check always fails. Cache the converted view keyed by id(source_tensor). FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty for the three output scratch buffers. Moving to a persistent scratch cache keyed on (num_queries, device) drops 3 torch.empty allocator calls per frame and replaces torch.zeros with an explicit counter.zero_() (still launches FillFunctor — no safe way to inline into the filter kernel since concurrent blocks would race with the zero — but eliminates allocator pressure and stabilizes pointer values for the Triton JIT cache). After W7 the per-frame kernel launch count drops from 2 incidental-torch kernels to 1, the 3 allocator calls are eliminated, and the adapter sees stable-address scratch across frames (latent prerequisite for CUDA-graph capture of the postproc path). Impact: - direct_copy: 538 -> 0 (-100%) - FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required) - torch.empty calls: 3/frame -> 0 - Parity: 0-diff vs v16 best across 538 frames of vehicles_312px. - End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the binding constraint, not mask kernel GPU time).

…path Stacks on top of PR#22 (optimize-rfdetr-seg: Triton fusion + CUDA graphs + scratch caching). See PR#28 for the same change against main. `InferenceModelsInstanceSegmentationAdapter.postprocess` built a full pydantic tree per frame — `Point × V` per polygon vertex, `InstanceSegmentationPrediction × N`, then `InstanceSegmentationInferenceResponse`. The workflow block then called `response.model_dump(by_alias=True, exclude_none=True)` to get a plain dict for `sv.Detections.from_inference`. Neither validation nor the serializer is needed on that path — the block only consumes the dict. This change adds slotted dataclass twins (`PointDC`, `InferenceResponseImageDC`, `InstanceSegmentationPredictionDC`, `InstanceSegmentationInferenceResponseDC`) plus `_is_pred_dc_to_dict` and `_is_response_dc_to_dict` helpers that emit the exact dict `model_dump(by_alias=True, exclude_none=True)` produces (same keys, same `class` alias, same None-omission). The adapter gates on `kwargs.get("source") == "workflow-execution"` and returns the dataclass response on that path. Every other caller — HTTP `response_model` at `http_api.py:1640`, `isinstance`-based cache dispatch at `cache/serializers.py:71`, `draw_predictions` visualization — keeps the pydantic path untouched. The v3 workflow block detects the dataclass via `isinstance` and calls `_is_response_dc_to_dict`; falls back to `model_dump` for any other response type. Microbench (4 dets × 6-vertex polygon, construct + dump): * pydantic: ~81 us/frame * dataclass: ~34 us/frame (2.43x faster) End-to-end (rfdetr-seg-nano TRT + Triton preproc + Triton fullpost + CUDA graphs, vehicles_312px.mp4, 538 frames, 4 runs each, on top of optimize-rfdetr-seg HEAD c1406a8): * baseline (pydantic): 152.93 FPS mean * dataclass: 156.54 FPS mean (+3.6 FPS, +2.4%) Bit-exact parity verified: `_is_response_dc_to_dict(dc)` byte-equals `pyd.model_dump(by_alias=True, exclude_none=True)` for mixed inputs (varying polygon lengths, empty list, mutation of .time/.inference_id post-construct). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude and others added 8 commits April 28, 2026 22:33

initial point

419c5a7

replace benchmark script with minimal InferencePipeline-based version

807ea1a

cleaning up

b14044d

cleaning up

7eefbcc

This was referenced May 3, 2026

perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark #32

Open

test(rfdetr): expand Triton preprocess parity and validation coverage #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/optimize rfdetr seg plus is seg dataclasses copy#31

Perf/optimize rfdetr seg plus is seg dataclasses copy#31
aseembits93 wants to merge 8 commits into
mainfrom
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy

aseembits93 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants