Perf/optimize rfdetr seg plus is seg dataclasses copy#31
Open
aseembits93 wants to merge 8 commits into
Open
Conversation
… stream sync reduction
Profiled RF-DETR nano seg TRT e2e workflow with nsys (Tesla T4, FP16 engine,
example_video.mp4 / 431 frames). Baseline 93.07 avg FPS. After the changes
below + enabling the existing CUDA-graph cache:
Baseline (no changes) 93.07 FPS
+ Triton preprocess (fused resize+BGR2RGB+norm) ~93 FPS (U6)
+ U7 mask-decode skip for empty masks ~94 FPS (flag-gated)
+ Triton postprocess conf-filter 98.6 FPS (+5.9%)
+ ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True 102.1 FPS (+9.7%)
+ Drop pre/post stream syncs 102.2 FPS (+9.8%)
Parity: 4/431 frames differ by ±1 detection vs baseline (Triton bilinear vs
cv2.resize rounding at mask boundaries). Unit tests pass (11/11).
Changes (all flag-gated, opt-in):
inference_models/models/rfdetr/triton_preprocess.py (new)
One Triton kernel fusing stretch-to resize + BGR->RGB + /255 + ImageNet
normalize for the RF-DETR seg preprocess path. Replaces ~8 torch CUDA
kernels with 1. Enabled via RFDETR_USE_TRITON_PREPROC=true.
inference_models/models/rfdetr/triton_postprocess.py (new)
One Triton kernel fusing sigmoid + argmax-over-classes + class-remap +
confidence-threshold filter. Replaces ~14k small cub/torch kernels with
431 (1 per frame). Supports both per-class threshold vector and scalar,
with optional class remapping table. Enabled via RFDETR_TRITON_POSTPROC=true.
inference_models/models/rfdetr/rfdetr_instance_segmentation_trt.py
- Wire the Triton preprocess fast-path in pre_process() with a guarded
dispatch (STRETCH_TO mode, numpy HWC BGR uint8 input, no static crop).
- Cache pre-allocated input buffer and normalization constants on model
instance on first call.
- Replace pre_process_stream.synchronize() with a CUDA event ev.wait()
on the inference stream so the CPU doesn't stall waiting for the
preprocessing Triton kernel to finish.
- Drop the post_process_stream.synchronize() (the adapter's subsequent
.cpu() calls provide the implicit sync).
inference_models/models/rfdetr/common.py
Wire the Triton postprocess conf-filter into
post_process_instance_segmentation_results. Falls back to torch path
when the model has no remapping table, is CPU-bound, or Triton is
unavailable.
inference/models/rfdetr/rfdetr.py + triton_preprocess.py (new, legacy path)
Same Triton preprocess kernel + dispatch for the legacy inference
package's RF-DETR class. Dormant on this platform (USE_INFERENCE_MODELS
default routes to inference_models adapters) but kept for parity so the
legacy path benefits if exercised.
inference/core/models/inference_models_adapters.py
GPU mask-decode fast-path (U7): reduce mask emptiness with .any(dim=(1,2))
on GPU, only DtoH + cv2.findContours non-empty masks. Gated via
RFDETR_GPU_POSTPROCESS=true (default on). Produces identical output to
the reference path.
Env vars introduced:
RFDETR_USE_TRITON_PREPROC=true opt-in; fused preproc kernel
RFDETR_TRITON_POSTPROC=true opt-in; fused postproc conf filter
RFDETR_GPU_POSTPROCESS=true default on; GPU mask emptiness skip
RFDETR_DISABLE_GPU_PREPROC=true opt-out; disable torch GPU preproc
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=True enables existing TRT CUDA graph cache
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (W2)
Adds triton_fullpostproc.py with two fused Triton kernels that replace the
entire post-TRT chain for the common rfdetr-seg-nano path (batch=1, no
static crop, stretch-to resize, class remapping active):
_rfdetr_fullpost_filter_kernel (grid = num_queries)
sigmoid argmax + class remap + conf threshold + cxcywh->xyxy +
letterbox-denormalize + clip + round; atomic_add into counter to reserve
a compact output slot.
_rfdetr_fullpost_mask_kernel_compact (grid = num_queries * tile_y * tile_x)
GPU-side bilinear upsample 78x78 -> orig_h x orig_w + threshold > 0 +
uint8 emit. Early-exits on s >= counter[0] without an intermediate sync.
Adapter (inference_models_adapters.py):
- New fast path keyed on _combined_gpu/_counter_gpu/_postproc_done_event
side-channels. Adapter stream waits the done_event, pinned-DtoH's the
4-byte counter, syncs once to read n_survivors, then slices combined
and mask to n_survivors and pinned-DtoH's both async, syncing again.
- Replaces the prior in-Triton int(counter.item()) that CPU-blocked the
postproc stream. Same number of host-visible syncs (2), but the first
is a 4-byte DtoH instead of a stream drain, and both are on a dedicated
pinned path so the copy engine overlaps with the compute engine.
TRT graph plumbing (common/trt.py, rfdetr_instance_segmentation_trt.py):
- Records a produce_event on the graph's own stream so consumers can
wait_event instead of stream.synchronize(). Removes the unconditional
stream.synchronize() in infer_from_trt_engine's graph-replay branch.
- consumer_done_event field on TRTCudaGraphState lets the next graph
replay chain on the consumer's last use of the output buffers.
- _trt_reuse_as_input_buffer marker so fast preproc can write directly
into the graph's captured input buffer, eliminating the per-frame DtoD.
Results on vehicles_312px.mp4 (538 frames, Tesla T4, FP16 engine):
v16 baseline (Triton preproc + postproc + CUDA graph) 150 FPS
+ triton_fullpost + deferred counter sync (this commit) 151 FPS
Parity vs v16 baseline: 0-diff across all 538 frames (bit-exact xyxy,
conf, class_id, and mask MD5 per detection).
Env flags:
RFDETR_TRITON_FULLPOSTPROC=true opt-in; enables the full-fusion path
Two per-frame CUDA kernel launches visible in nsys on the v16 full-postproc
path that shouldn't be there:
- direct_copy_kernel_cuda (538 per 538-frame run on vehicles_312px)
- vectorized_elementwise_kernel<FillFunctor<int>> (538 / 538)
direct_copy was class_mapping.to(dtype=torch.int32) firing every frame —
upstream stores the mapping as int64, our Triton kernel needs int32, and
the wrapper re-converts on every call since the dtype check always fails.
Cache the converted view keyed by id(source_tensor).
FillFunctor was torch.zeros((1,), ...) for the atomic counter + torch.empty
for the three output scratch buffers. Moving to a persistent scratch cache
keyed on (num_queries, device) drops 3 torch.empty allocator calls per
frame and replaces torch.zeros with an explicit counter.zero_() (still
launches FillFunctor — no safe way to inline into the filter kernel since
concurrent blocks would race with the zero — but eliminates allocator
pressure and stabilizes pointer values for the Triton JIT cache).
After W7 the per-frame kernel launch count drops from 2 incidental-torch
kernels to 1, the 3 allocator calls are eliminated, and the adapter sees
stable-address scratch across frames (latent prerequisite for CUDA-graph
capture of the postproc path).
Impact:
- direct_copy: 538 -> 0 (-100%)
- FillFunctor: 538 -> 538 (unchanged; counter.zero_ still required)
- torch.empty calls: 3/frame -> 0
- Parity: 0-diff vs v16 best across 538 frames of vehicles_312px.
- End-to-end FPS: 150 -> 151 (noise-level; serial CPU dispatch is the
binding constraint, not mask kernel GPU time).
…path
Stacks on top of PR#22 (optimize-rfdetr-seg: Triton fusion + CUDA graphs
+ scratch caching). See PR#28 for the same change against main.
`InferenceModelsInstanceSegmentationAdapter.postprocess` built a full
pydantic tree per frame — `Point × V` per polygon vertex,
`InstanceSegmentationPrediction × N`, then
`InstanceSegmentationInferenceResponse`. The workflow block then called
`response.model_dump(by_alias=True, exclude_none=True)` to get a plain
dict for `sv.Detections.from_inference`. Neither validation nor the
serializer is needed on that path — the block only consumes the dict.
This change adds slotted dataclass twins (`PointDC`,
`InferenceResponseImageDC`, `InstanceSegmentationPredictionDC`,
`InstanceSegmentationInferenceResponseDC`) plus `_is_pred_dc_to_dict`
and `_is_response_dc_to_dict` helpers that emit the exact dict
`model_dump(by_alias=True, exclude_none=True)` produces (same keys,
same `class` alias, same None-omission).
The adapter gates on `kwargs.get("source") == "workflow-execution"`
and returns the dataclass response on that path. Every other caller —
HTTP `response_model` at `http_api.py:1640`, `isinstance`-based cache
dispatch at `cache/serializers.py:71`, `draw_predictions`
visualization — keeps the pydantic path untouched.
The v3 workflow block detects the dataclass via `isinstance` and calls
`_is_response_dc_to_dict`; falls back to `model_dump` for any other
response type.
Microbench (4 dets × 6-vertex polygon, construct + dump):
* pydantic: ~81 us/frame
* dataclass: ~34 us/frame (2.43x faster)
End-to-end (rfdetr-seg-nano TRT + Triton preproc + Triton fullpost +
CUDA graphs, vehicles_312px.mp4, 538 frames, 4 runs each, on top of
optimize-rfdetr-seg HEAD c1406a8):
* baseline (pydantic): 152.93 FPS mean
* dataclass: 156.54 FPS mean (+3.6 FPS, +2.4%)
Bit-exact parity verified: `_is_response_dc_to_dict(dc)` byte-equals
`pyd.model_dump(by_alias=True, exclude_none=True)` for mixed inputs
(varying polygon lengths, empty list, mutation of .time/.inference_id
post-construct).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.