Skip to content

perf(trt): zero-copy CUDA graph replay + cross-stream event handshake#35

Open
aseembits93 wants to merge 1 commit into
mainfrom
perf/trt-cuda-graph-zerocopy
Open

perf(trt): zero-copy CUDA graph replay + cross-stream event handshake#35
aseembits93 wants to merge 1 commit into
mainfrom
perf/trt-cuda-graph-zerocopy

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

@aseembits93 aseembits93 commented May 4, 2026

Summary

Drops the per-replay DtoD copy and output clone on the TRT CUDA-graph replay path, and replaces the trailing stream.synchronize() in infer_from_trt_engine with a _trt_produce_event recorded on the graph's own capture stream (or the caller's stream when no graph ran).

  • New opt-in marker _trt_reuse_as_input_buffer on the caller's input tensor: when set, the graph is captured against that buffer and subsequent replays skip the copy into the graph's internal input buffer.
  • Output buffers are returned directly (no buf.clone()); consumers that read directly can record into consumer_done_event so the next replay waits on the prior consumer on the graph's stream.
  • Behavior change for existing callers: the function no longer CPU-syncs before returning. Callers that immediately .cpu()/.item() still work (those insert their own sync); cross-stream consumers should wait on results[0]._trt_produce_event.

End-to-end benchmark

rfdetr-seg-nano TRT, Tesla T4, FP16 engine, vehicles_312px.mp4, 538 frames, 4 post-warmup runs per config. Isolating this change only (no other opt-in perf paths):

CUDA graphs mean FPS
off 114.93
on 119.49

Δ +4.56 FPS (+4.0%)

Parity verified vs graphs-off run: bit-exact xyxy / conf / class_id and mask MD5 per detection.

How to reproduce

Minimal benchmark script (save as bench_rfdetr_seg.py at repo root):

"""Minimal benchmark: RF-DETR instance segmentation through inference-models,
run via InferencePipeline on a single video source."""
import argparse
import os

_ALL_BACKENDS = {
    "torch", "torch-script", "onnx", "trt",
    "hugging-face", "ultralytics", "mediapipe", "custom",
}


def _select_backend_from_argv() -> str:
    pre = argparse.ArgumentParser(add_help=False)
    pre.add_argument("--backend", choices=("trt", "onnx", "torch"), default="trt")
    args, _ = pre.parse_known_args()
    return args.backend


_BACKEND = _select_backend_from_argv()
os.environ.setdefault(
    "ONNXRUNTIME_EXECUTION_PROVIDERS",
    "[TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider]",
)
os.environ["DISABLED_INFERENCE_MODELS_BACKENDS"] = ",".join(
    sorted(_ALL_BACKENDS - {_BACKEND})
)

from time import perf_counter
from inference import InferencePipeline


def build_workflow(model_id: str, confidence: float) -> dict:
    return {
        "version": "1.0",
        "inputs": [{"type": "WorkflowImage", "name": "image"}],
        "steps": [
            {
                "type": "roboflow_core/roboflow_instance_segmentation_model@v3",
                "name": "segmentation",
                "images": "$inputs.image",
                "model_id": model_id,
                "confidence_mode": "custom",
                "custom_confidence": confidence,
            },
        ],
        "outputs": [
            {
                "type": "JsonField",
                "name": "predictions",
                "selector": "$steps.segmentation.predictions",
            },
        ],
    }


FRAME_COUNT = 0
START_TIME = None
PROGRESS_EVERY = 50


def sink(predictions, _video_frames) -> None:
    global FRAME_COUNT, START_TIME
    del _video_frames
    if not isinstance(predictions, list):
        predictions = [predictions]
    FRAME_COUNT += sum(p is not None for p in predictions)
    if START_TIME is None:
        START_TIME = perf_counter()
    if FRAME_COUNT % PROGRESS_EVERY == 0:
        fps = FRAME_COUNT / (perf_counter() - START_TIME)
        print(f"[progress] frames={FRAME_COUNT} fps={fps:.2f}", flush=True)


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--video_reference", required=True)
    parser.add_argument("--model_id", default="rfdetr-seg-nano")
    parser.add_argument("--confidence", type=float, default=0.4)
    parser.add_argument(
        "--backend", choices=("trt", "onnx", "torch"), default="trt",
    )
    args = parser.parse_args()

    pipeline = InferencePipeline.init_with_workflow(
        video_reference=args.video_reference,
        workflow_specification=build_workflow(args.model_id, args.confidence),
        on_prediction=sink,
    )
    pipeline.start()
    pipeline.join()

    elapsed = perf_counter() - START_TIME if START_TIME else 0.0
    fps = FRAME_COUNT / elapsed if elapsed > 0 else 0.0
    print(f"frames={FRAME_COUNT} elapsed={elapsed:.2f}s fps={fps:.2f}")


if __name__ == "__main__":
    main()

Commands (1 warmup + 4 measured runs per config; take the mean of the 4):

# baseline (CUDA graphs OFF)
for i in 1 2 3 4 5; do
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=false \
    python bench_rfdetr_seg.py \
      --video_reference vehicles_312px.mp4 \
      --model_id rfdetr-seg-nano \
      --confidence 0.4 \
      --backend trt
done

# this PR (CUDA graphs ON)
for i in 1 2 3 4 5; do
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true \
    python bench_rfdetr_seg.py \
      --video_reference vehicles_312px.mp4 \
      --model_id rfdetr-seg-nano \
      --confidence 0.4 \
      --backend trt
done

Each run prints a final frames=… elapsed=…s fps=… line; drop the first run per config as warmup and average the remaining four.

Test plan

  • Run inference_models TRT integration suite on a CUDA host
  • Sanity-check existing single-stream callers (they should still function via implicit ordering + .cpu() syncs)
  • Spot-check CUDA-graph cache eviction still releases the captured context

Generated with Claude Code

Drops the per-replay DtoD copy and output clone on the TRT CUDA-graph
replay path. Callers opt in by setting ``_trt_reuse_as_input_buffer`` on
their preallocated input tensor (graph is captured against that buffer
and reused in-place). Output buffers are returned directly; consumers
chain on ``_trt_produce_event`` and record into ``consumer_done_event``
so the next replay waits on the prior consumer instead of the host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants