Skip to content

perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark#32

Open
aseembits93 wants to merge 1 commit into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
perf/rfdetr-postproc-torch-compile-ablation
Open

perf(rfdetr-seg): add torch.compile vs Triton fullpost ablation benchmark#32
aseembits93 wants to merge 1 commit into
perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copyfrom
perf/rfdetr-postproc-torch-compile-ablation

Conversation

@aseembits93

Copy link
Copy Markdown
Owner

Summary

Adds a benchmark harness that answers: can torch.compile match the perf of the Triton fullpost kernel (triton_rfdetr_fullpost / W2) introduced in #31?

Five variants, all producing equivalent outputs on the fullpost-eligible path (batch=1, STRETCH_TO, class remapping, no static crop):

Variant T4 per-iter Speedup Kernel launches/iter
eager 3.85 ms 1.00x 60
compiled (naive) 3.75 ms 1.03x 33
compiled_fixed 3.57 ms 1.08x 14
compiled_hybrid 1.72 ms 2.25x 14
triton fullpost 1.56 ms 2.47x 3.4

Key findings:

  • Naive torch.compile is ~1x because Dynamo graph-breaks on every boolean-mask index in the filter chain (aten.nonzero is data-dependent). Even with capture_dynamic_output_shape_ops=True, it partitions into 7 cudagraph regions.
  • Hybrid variant (compile filter+bbox, gather+upsample survivors eager) gets within ~10% of Triton. Two tricks: compile the shape-static prefix, and swap TVF.resize for F.interpolate — TVF defaults to antialiased (upsample_gen2d_aa_out_frame), which is ~2x slower than plain bilinear at this size.
  • Remaining 10% gap = atomic-counter over-launch pattern (filter kernel reserves compact slots via tl.atomic_add; mask kernel over-launches for all 300 queries and early-exits by reading the counter on-GPU). No torch.compile equivalent.

See development/benchmark_scripts/README_rfdetr_postproc_ablation.md for the full writeup.

Usage

# all variants, 200 iters / 50 warmup
python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py

# parity check
python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py --parity-check

# nsys profiling (NVTX ranges per variant)
nsys profile -t cuda,nvtx -o report.qdstrm \
  python development/benchmark_scripts/benchmark_rfdetr_postproc_ablation.py \
  --mode triton --iters 50 --warmup 20 --nsys

Test plan

  • Parity check: eager / triton produce 25 survivors each with matching conf ranges (0.8594..0.9750)
  • All five variants run clean on T4 (torch 2.10, triton 3.6)
  • Triton path gracefully degrades (skipped) if triton_fullpostproc unavailable

Base branch

Targets #31 (perf/optimize-rfdetr-seg-plus-is-seg-dataclasses-copy) since the benchmark imports triton_rfdetr_fullpost which only exists on that branch.

🤖 Generated with Claude Code

…mark

Benchmarks five post-process variants on the fullpost-eligible path to
evaluate whether torch.compile can match the Triton fullpost kernel:

  - eager:           baseline torch ops (what the non-Triton path runs)
  - compiled:        torch.compile(eager) with dynamic=True
  - compiled_fixed:  fixed-shape torch.compile (upsample all Q masks)
  - compiled_hybrid: compile filter+bbox; gather+upsample survivors eager
  - triton:          triton_rfdetr_fullpost

On T4 with ~25 survivors / 720x1280: compiled_hybrid reaches 2.25x vs
2.47x for Triton. Naive torch.compile is ~1x due to graph breaks on the
boolean-mask indexing chain. Remaining gap comes from the atomic-counter
over-launch pattern, which has no torch.compile equivalent.

Includes an NVTX-annotated mode for nsys profiling and a parity check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant