perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark by aseembits93 · Pull Request #32 · aseembits93/inference

aseembits93 · 2026-05-03T21:25:21Z

Summary

Adds a benchmark harness that answers: can torch.compile match the perf of the Triton fullpost kernel (triton_rfdetr_fullpost / W2) introduced in #31?

Five variants, all producing equivalent outputs on the fullpost-eligible path (batch=1, STRETCH_TO, class remapping, no static crop):

Variant	T4 per-iter	Speedup	Kernel launches/iter
eager	3.85 ms	1.00x	60
compiled (naive)	3.75 ms	1.03x	33
compiled_fixed	3.57 ms	1.08x	14
compiled_hybrid	1.72 ms	2.25x	14
triton fullpost	1.56 ms	2.47x	3.4

Key findings:

Naive torch.compile is ~1x because Dynamo graph-breaks on every boolean-mask index in the filter chain (aten.nonzero is data-dependent). Even with capture_dynamic_output_shape_ops=True, it partitions into 7 cudagraph regions.
Hybrid variant (compile filter+bbox, gather+upsample survivors eager) gets within ~10% of Triton. Two tricks: compile the shape-static prefix, and swap TVF.resize for F.interpolate — TVF defaults to antialiased (upsample_gen2d_aa_out_frame), which is ~2x slower than plain bilinear at this size.
Remaining 10% gap = atomic-counter over-launch pattern (filter kernel reserves compact slots via tl.atomic_add; mask kernel over-launches for all 300 queries and early-exits by reading the counter on-GPU). No torch.compile equivalent.

See development/benchmark_scripts/README_rfdetr_postproc_ablation.md for the full writeup.

Usage

# all variants, 200 iters / 50 warmup
python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py

# parity check
python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py --parity-check

# nsys profiling (NVTX ranges per variant)
nsys profile -t cuda,nvtx -o report.qdstrm \
  python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py \
  --mode triton --iters 50 --warmup 20 --nsys

Test plan

Parity check: eager / triton produce 25 survivors each with matching conf ranges (0.8594..0.9750)
All five variants run clean on T4 (torch 2.10, triton 3.6)
Triton path gracefully degrades (skipped) if triton_fullpostproc unavailable

Base branch

Targets #31 (perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy) since the benchmark imports triton_rfdetr_fullpost which only exists on that branch.

🤖 Generated with Claude Code

…mark Benchmarks five post-process variants on the fullpost-eligible path to evaluate whether torch.compile can match the Triton fullpost kernel: - eager: baseline torch ops (what the non-Triton path runs) - compiled: torch.compile(eager) with dynamic=True - compiled_fixed: fixed-shape torch.compile (upsample all Q masks) - compiled_hybrid: compile filter+bbox; gather+upsample survivors eager - triton: triton_rfdetr_fullpost On T4 with ~25 survivors / 720x1280: compiled_hybrid reaches 2.25x vs 2.47x for Triton. Naive torch.compile is ~1x due to graph breaks on the boolean-mask indexing chain. Remaining gap comes from the atomic-counter over-launch pattern, which has no torch.compile equivalent. Includes an NVTX-annotated mode for nsys profiling and a parity check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark#32

perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark#32
aseembits93 wants to merge 1 commit into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
perf/rfdetr-postproc-torch-compile-ablation

aseembits93 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aseembits93 commented May 3, 2026

Summary

Usage

Test plan

Base branch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant