perf: Triton pre-processing kernel by aseembits93 · Pull Request #2328 · roboflow/inference

aseembits93 · 2026-05-12T05:52:01Z

What does this PR do?

Introducing a single Triton CUDA kernel which executes image pre-processing as a single kernel, eliminating kernel launch overheads and reducing CPU<->GPU memory transfers to the bare minimum. The kernel supports all options EXCEPT dataset_version_resize (two-stage resize: cv2 dataset-version resize → PIL stretch) and contrast (three distinct algorithms: histogram eq / CLAHE / contrast stretching; each needs its own prepass kernel).

Type of Change

New feature (non-breaking change that adds functionality)

Testing

I have tested this change locally
I have added/updated tests for this change

Test details: (Make sure triton is installed in your environment)

Performance gains on TensorRT video input. Run twice with

USE_TRITON_FOR_PREPROCESSING="false" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

USE_TRITON_FOR_PREPROCESSING="true" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

vehicles_312px.mp4 (538 frames, src 312×176): T4 GPU

	fps	ms/frame
PIL reference (env=false)	76.25	13.11
Triton fast path (env=true)	99.83	10.02
Δ	+31%	−3.09 ms

vehicles_1080p.mp4 (538 frames, src 1920×1080 — preproc has real resize work): T4 GPU

	fps	elapsed
PIL reference (env=false)	14.05	38.29 s
Triton fast path (env=true)	21.34	25.21 s
Δ	+52%	−13.1 s

Correctness guarantees on COCOval2017. Make sure have the coco/ in the current working directory. Run python temp/detection_parity_full.py

	Triton fast path (env=true)	PIL reference (env=false)
Triton kernel calls	5000 / 5000	0
Detections	26,721	26,721
Matched at IoU>0.5	26,721 (100%)	—
Mean box IoU	1.000000	—
Mean \|Δscore\|	0.000e+00	—
Class-id disagreements	0	—
Pixel-identical masks	26,721 / 26,721	—

Preproc microbench (isolated pre_process()) T4 GPU

USE_TRITON_FOR_PREPROCESSING="false"  python temp/preproc_microbench.py
USE_TRITON_FOR_PREPROCESSING="true" python temp/preproc_microbench.py

src -> 312²	PIL latency mean / p50 / p95 / p99	PIL fps mean / p50 / p95 / p99	Triton latency mean / p50 / p95 / p99	Triton fps mean / p50 / p95 / p99	Mean latency speedup
312x312	1.940 / 1.901 / 2.360 / 2.394 ms	517.2 / 526.0 / 536.2 / 538.2 fps	0.322 / 0.312 / 0.368 / 0.683 ms	3165.3 / 3202.1 / 3447.4 / 3460.8 fps	~6.0x
720x1280	14.040 / 14.027 / 14.667 / 15.149 ms	71.3 / 71.3 / 74.3 / 75.1 fps	1.619 / 1.837 / 2.067 / 2.090 ms	639.5 / 544.3 / 779.3 / 784.7 fps	~8.7x
1080x1920	27.880 / 27.961 / 28.264 / 28.317 ms	35.9 / 35.8 / 36.6 / 37.0 fps	2.223 / 2.184 / 2.436 / 2.606 ms	450.7 / 457.9 / 463.5 / 465.4 fps	~12.5x

Side-by-Side visual comparison

https://drive.google.com/file/d/1aXWk0hgTsMsfUDwqF7wxqK9YrgMUPH09/view?usp=sharing

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

Replace the per-frame PIL-bilinear-antialias + to_tensor + normalize chain in the RF-DETR TRT instance-segmentation model with a single Triton kernel that resizes, swaps BGR↔RGB, scales by 1/255, and applies ImageNet normalization — writing straight into the preallocated TRT input buffer. Byte-exact port of PIL's separable bilinear-antialias resize (PRECISION_BITS=22, int32 fixed-point, uint8 quantization between the horizontal and vertical passes). The horizontal uint8 intermediate lives in registers. Correctness - Preproc max abs error vs PIL: 4.77e-7 (fp32 ULP on the final /255+normalize step; the uint8 resize result is byte-identical). - Full coco/val2017 detection parity (rfdetr-seg-nano, conf=0.4): 26,721 / 26,721 matched at IoU>0.5, mean box IoU 1.0000, |Δscore| 0, 0 class-id disagreements, all matched masks pixel-identical. Performance (vehicles_312px.mp4, 538 frames) - Baseline (PIL path): 76.25 fps - Triton fast path: 99.83 fps (+31%) - Preproc microbench (1080p → 312²): 27.0 ms → 2.8 ms per frame (~10×) Scope - Gated on: single-image numpy uint8 HWC input, stretch/letterbox/ center-crop/letterbox-reflect resize modes (all collapse to a single PIL stretch when dataset_version_resize_dimensions is None, verified via synthetic-package test), no static_crop/grayscale/contrast, 3-channel, scaling_factor in {None, 255}, normalization set. - Falls back to the existing PIL-based pre_process_network_input when any precondition fails. Also adds the benchmark driver development/stream_interface/rfdetr_nano_seg_trt_workflow.py used to measure the above numbers.

…hed input Move the Triton fast-path gate from RFDetrForInstanceSegmentationTRT into pre_process_network_input so all six RFDetr classes (seg×{TRT,ONNX,Torch} and detect×{TRT,ONNX,Torch}) can hit it, and widen the predicate to accept torch uint8 HWC tensors on any device plus batched inputs (list[ndarray], list[Tensor], 4D ndarray/Tensor — the outer function already unbinds those to lists before the per-item check). Color-swap parity fix: the PIL path does `image[:, :, ::-1]` whenever `input_color_mode != network_input.color_mode`, which is True for an unspecified caller (None). The old fast-path treated None as BGR and skipped the swap when the network was also BGR — byte-identical to PIL for packaged seg models but diverged from PIL on og-rfdetr-base (ColorMode.BGR network with None caller). Align the kernel swap condition with PIL's. Integration coverage (144 tests, CUDA 13): baseline: 4 tests hit fast path, 6 / 160 pre_process calls widened : 35 tests hit fast path, 43 / 166 pre_process calls pass rate unchanged: 144 / 144. Remaining ~100 tests miss on predicate categories that require kernel extensions (static_crop, contrast, dataset_version_resize) and are tracked as follow-up work.

Apply static_crop as a load-time offset in the kernel (+ crop-dims-based resample tables), matching apply_static_crop_to_numpy_image's pixel- coordinate percentage math. Extends fast-path coverage from 35 → 55 of 144 rfdetr integration tests, pass rate unchanged (144/144).

torchvision.io.read_image returns CHW uint8 — 72 test calls in the integration suite arrive in that layout. Mirror _tensor_to_hwc_uint8's CHW heuristic (first dim in {1,3,4} and last dim not in {1,3,4}) and permute to HWC before the kernel. Integration coverage 55 → 113 tests hitting fast path, zero regressions.

INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (default true). Setting it to false short-circuits _fast_path_eligible so every call falls back to the PIL reference path — useful for A/B benchmarking and as an escape hatch if the fused kernel is ever implicated in a regression. Verified on rfdetr-seg-nano: the kernel fires on every eligible call when env=true (5000/5000 on full coco/val2017) and never fires when env=false (0/5000), with byte-identical predictions in both states. e2e on vehicles_312px.mp4 (538 frames, rfdetr-seg-nano TRT): env=true : 99.3 fps env=false: 76.2 fps

All scripts are driven by INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (true → Triton fast path, false → PIL reference). Each run prints a Triton kernel invocation count so it is visible from the console which path handled each image. Scripts: - parity_triton_vs_pil.py — kernel-vs-PIL fp32 ULP check (20 imgs, direct kernel call; bypasses the model stack) - detection_parity_full.py — 5000-img end-to-end parity driver. Spawns one subprocess per env value (so the module- level env read is re-done), pickles per-image detections + Triton call count, then compares. - parity_env_var.py — same idea at 100 imgs, a quick sanity run. - coco_map.py — bbox + segm mAP via pycocotools; run twice with env=true/false to confirm mAP matches to 4 decimals. - preproc_microbench.py — isolated pre_process() timing at 312² / 720×1280 / 1080×1920. - _fastpath_trace.py — shared instrumentation helper. Patches _fast_path_eligible + _fast_path_preprocess + triton_preprocess_rfdetr_stretch + the two PIL fallbacks and prints per-surface call counts at exit. Used by `python run_with_trace.py <script>` for independent kill-switch audits.

Co-authored-by: Lee Clement <lee@roboflow.com>

dkosowski87

Really nice idea 👍 left a couple of comments relating to our setup

dkosowski87 · 2026-05-12T11:02:39Z

+import torch
+
+try:
+    import triton


Right now in inference_models/uv.lock we can see that triton is installed due to being a dependency for torch. We explicitly import triton here so we have a direct dependency. Although my gut feeling here is pinning the dependancy separately may create some headaches, given that we have more than one extra with a separate torch setup. So perhaps leaving as is with the triton dependency being directly bundled with torch is a better approach. WDYT @PawelPeczek-Roboflow ?

how about providing expplicit definition of triton with broad boundaries?

triton now in pyproject.toml

CLAassistant · 2026-05-13T21:39:35Z

All committers have signed the CLA.

dkosowski87

One last comment. PR looks, good. 👍
I'll do a test on our side and if everything goes ok, I'll approve and merge.

dkosowski87

Sorry for not mentioning this earlier, but I see a need for adding some tests around this change. At minimum:

Numerical parity - we have the _reference_pipeline and _build_network_input helpers, we can monkeypatch using USE_TRITON_FOR_PREPROCESSING=True and add proper asserts. This would be @pytest.mark.gpu_only. 2-3 pre-processing scenarios, different config.
Integration check - mock to check if triton_path_preprocess path is correctly being resolved given the possible configuration

aseembits93 · 2026-06-03T01:45:32Z

Opening another PR with an alternative kernel which is even faster for larger images.

aseembits93 added 9 commits May 9, 2026 00:30

rename flag and move to env.py

06a02b2

almost ready

338feeb

removing temp benchmark and sanity check files

63522cc

aseembits93 requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, hansent, probicheaux, rafel-roboflow and yeldarby as code owners May 12, 2026 05:52

aseembits93 added 2 commits May 11, 2026 22:55

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

ecc2345

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

442516e

leeclemnet reviewed May 12, 2026

View reviewed changes

Comment thread inference_models/inference_models/models/rfdetr/pre_processing.py Outdated

aseembits93 and others added 2 commits May 12, 2026 10:10

Update inference_models/inference_models/models/rfdetr/pre_processing.py

8c10378

Co-authored-by: Lee Clement <lee@roboflow.com>

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

ca5c5e4

dkosowski87 reviewed May 12, 2026

View reviewed changes

aseembits93 added 2 commits May 12, 2026 12:20

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

5653b96

move env var to inference_models

978d80e

aseembits93 added 2 commits May 13, 2026 21:46

add seed for reproducibility

f0e7717

remove static crop var

6b4f745

aseembits93 force-pushed the perf/rfdetr-seg-triton-widen-scope branch from 6c974b7 to 6b4f745 Compare May 13, 2026 21:47

aseembits93 added 2 commits May 13, 2026 14:48

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

84459cf

Use model errors in Triton preprocess

03e3aa3

aseembits93 added 4 commits May 13, 2026 22:07

Warn when Triton preprocessing is unavailable

1a80fd9

Bound RF-DETR resample table cache

8ec83eb

update changelog

30b4129

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

f23789f

dkosowski87 reviewed May 19, 2026

View reviewed changes

Comment thread inference_models/docs/changelog.md Outdated

aseembits93 added 4 commits May 19, 2026 11:08

remove testing scripts, move changelog update to new version

5761e0f

remove minimal workflow script

424319c

make style make check_code_quality pass

7e4609f

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

b859a26

dkosowski87 reviewed May 21, 2026

View reviewed changes

aseembits93 added 7 commits May 22, 2026 00:39

add correctness and integration test

6e83901

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

8b13056

typo

643ad84

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

3d7cc82

tighter bounds on correctness

6136c92

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

ee72821

default is opt-in for triton preproc

e3806f5

aseembits93 mentioned this pull request Jun 2, 2026

Optimize RF-DETR Triton preprocessing aseembits93/inference#45

Open

10 tasks

aseembits93 closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Triton pre-processing kernel#2328

perf: Triton pre-processing kernel#2328
aseembits93 wants to merge 34 commits into
roboflow:mainfrom
aseembits93:perf/rfdetr-seg-triton-widen-scope

aseembits93 commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

dkosowski87 left a comment

Uh oh!

Uh oh!

dkosowski87 May 12, 2026

Uh oh!

PawelPeczek-Roboflow May 12, 2026

Uh oh!

grzegorz-roboflow May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 13, 2026 •

edited

Loading

Uh oh!

dkosowski87 left a comment

Uh oh!

Uh oh!

dkosowski87 left a comment

Uh oh!

aseembits93 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

aseembits93 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Testing

Checklist

Additional Context

Uh oh!

Uh oh!

dkosowski87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dkosowski87 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PawelPeczek-Roboflow May 12, 2026

Choose a reason for hiding this comment

Uh oh!

grzegorz-roboflow May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkosowski87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dkosowski87 left a comment

Choose a reason for hiding this comment

Uh oh!

aseembits93 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aseembits93 commented May 12, 2026 •

edited

Loading

CLAassistant commented May 13, 2026 •

edited

Loading