perf(trt): zero-copy CUDA graph replay + cross-stream event handshake by aseembits93 · Pull Request #35 · aseembits93/inference

aseembits93 · 2026-05-04T05:54:56Z

Summary

Drops the per-replay DtoD copy and output clone on the TRT CUDA-graph replay path, and replaces the trailing stream.synchronize() in infer_from_trt_engine with a _trt_produce_event recorded on the graph's own capture stream (or the caller's stream when no graph ran).

New opt-in marker _trt_reuse_as_input_buffer on the caller's input tensor: when set, the graph is captured against that buffer and subsequent replays skip the copy into the graph's internal input buffer.
Output buffers are returned directly (no buf.clone()); consumers that read directly can record into consumer_done_event so the next replay waits on the prior consumer on the graph's stream.
Behavior change for existing callers: the function no longer CPU-syncs before returning. Callers that immediately .cpu()/.item() still work (those insert their own sync); cross-stream consumers should wait on results[0]._trt_produce_event.

End-to-end benchmark

rfdetr-seg-nano TRT, Tesla T4, FP16 engine, vehicles_312px.mp4, 538 frames, 4 post-warmup runs per config. Isolating this change only (no other opt-in perf paths):

CUDA graphs	mean FPS
off	114.93
on	119.49

Δ +4.56 FPS (+4.0%)

Parity verified vs graphs-off run: bit-exact xyxy / conf / class_id and mask MD5 per detection.

How to reproduce

Minimal benchmark script (save as bench_rfdetr_seg.py at repo root):

"""Minimal benchmark: RF-DETR instance segmentation through inference-models,
run via InferencePipeline on a single video source."""
import argparse
import os

_ALL_BACKENDS = {
    "torch", "torch-script", "onnx", "trt",
    "hugging-face", "ultralytics", "mediapipe", "custom",
}


def _select_backend_from_argv() -> str:
    pre = argparse.ArgumentParser(add_help=False)
    pre.add_argument("--backend", choices=("trt", "onnx", "torch"), default="trt")
    args, _ = pre.parse_known_args()
    return args.backend


_BACKEND = _select_backend_from_argv()
os.environ.setdefault(
    "ONNXRUNTIME_EXECUTION_PROVIDERS",
    "[TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider]",
)
os.environ["DISABLED_INFERENCE_MODELS_BACKENDS"] = ",".join(
    sorted(_ALL_BACKENDS - {_BACKEND})
)

from time import perf_counter
from inference import InferencePipeline


def build_workflow(model_id: str, confidence: float) -> dict:
    return {
        "version": "1.0",
        "inputs": [{"type": "WorkflowImage", "name": "image"}],
        "steps": [
            {
                "type": "roboflow_core/roboflow_instance_segmentation_model@v3",
                "name": "segmentation",
                "images": "$inputs.image",
                "model_id": model_id,
                "confidence_mode": "custom",
                "custom_confidence": confidence,
            },
        ],
        "outputs": [
            {
                "type": "JsonField",
                "name": "predictions",
                "selector": "$steps.segmentation.predictions",
            },
        ],
    }


FRAME_COUNT = 0
START_TIME = None
PROGRESS_EVERY = 50


def sink(predictions, _video_frames) -> None:
    global FRAME_COUNT, START_TIME
    del _video_frames
    if not isinstance(predictions, list):
        predictions = [predictions]
    FRAME_COUNT += sum(p is not None for p in predictions)
    if START_TIME is None:
        START_TIME = perf_counter()
    if FRAME_COUNT % PROGRESS_EVERY == 0:
        fps = FRAME_COUNT / (perf_counter() - START_TIME)
        print(f"[progress] frames={FRAME_COUNT} fps={fps:.2f}", flush=True)


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--video_reference", required=True)
    parser.add_argument("--model_id", default="rfdetr-seg-nano")
    parser.add_argument("--confidence", type=float, default=0.4)
    parser.add_argument(
        "--backend", choices=("trt", "onnx", "torch"), default="trt",
    )
    args = parser.parse_args()

    pipeline = InferencePipeline.init_with_workflow(
        video_reference=args.video_reference,
        workflow_specification=build_workflow(args.model_id, args.confidence),
        on_prediction=sink,
    )
    pipeline.start()
    pipeline.join()

    elapsed = perf_counter() - START_TIME if START_TIME else 0.0
    fps = FRAME_COUNT / elapsed if elapsed > 0 else 0.0
    print(f"frames={FRAME_COUNT} elapsed={elapsed:.2f}s fps={fps:.2f}")


if __name__ == "__main__":
    main()

Commands (1 warmup + 4 measured runs per config; take the mean of the 4):

# baseline (CUDA graphs OFF)
for i in 1 2 3 4 5; do
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=false \
    python bench_rfdetr_seg.py \
      --video_reference vehicles_312px.mp4 \
      --model_id rfdetr-seg-nano \
      --confidence 0.4 \
      --backend trt
done

# this PR (CUDA graphs ON)
for i in 1 2 3 4 5; do
  ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true \
    python bench_rfdetr_seg.py \
      --video_reference vehicles_312px.mp4 \
      --model_id rfdetr-seg-nano \
      --confidence 0.4 \
      --backend trt
done

Each run prints a final frames=… elapsed=…s fps=… line; drop the first run per config as warmup and average the remaining four.

Test plan

Run inference_models TRT integration suite on a CUDA host
Sanity-check existing single-stream callers (they should still function via implicit ordering + .cpu() syncs)
Spot-check CUDA-graph cache eviction still releases the captured context

Generated with Claude Code

Drops the per-replay DtoD copy and output clone on the TRT CUDA-graph replay path. Callers opt in by setting ``_trt_reuse_as_input_buffer`` on their preallocated input tensor (graph is captured against that buffer and reused in-place). Output buffers are returned directly; consumers chain on ``_trt_produce_event`` and record into ``consumer_done_event`` so the next replay waits on the prior consumer instead of the host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(trt): zero-copy CUDA graph replay + cross-stream event handshake#35

perf(trt): zero-copy CUDA graph replay + cross-stream event handshake#35
aseembits93 wants to merge 1 commit into
mainfrom
perf/trt-cuda-graph-zerocopy

aseembits93 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end benchmark

How to reproduce

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aseembits93 commented May 4, 2026 •

edited

Loading