perf: Triton pre-processing kernel#2328
Conversation
Replace the per-frame PIL-bilinear-antialias + to_tensor + normalize chain
in the RF-DETR TRT instance-segmentation model with a single Triton
kernel that resizes, swaps BGR↔RGB, scales by 1/255, and applies
ImageNet normalization — writing straight into the preallocated TRT
input buffer.
Byte-exact port of PIL's separable bilinear-antialias resize
(PRECISION_BITS=22, int32 fixed-point, uint8 quantization between the
horizontal and vertical passes). The horizontal uint8 intermediate
lives in registers.
Correctness
- Preproc max abs error vs PIL: 4.77e-7 (fp32 ULP on the final
/255+normalize step; the uint8 resize result is byte-identical).
- Full coco/val2017 detection parity (rfdetr-seg-nano, conf=0.4):
26,721 / 26,721 matched at IoU>0.5, mean box IoU 1.0000,
|Δscore| 0, 0 class-id disagreements, all matched masks
pixel-identical.
Performance (vehicles_312px.mp4, 538 frames)
- Baseline (PIL path): 76.25 fps
- Triton fast path: 99.83 fps (+31%)
- Preproc microbench (1080p → 312²): 27.0 ms → 2.8 ms per frame (~10×)
Scope
- Gated on: single-image numpy uint8 HWC input, stretch/letterbox/
center-crop/letterbox-reflect resize modes (all collapse to a single
PIL stretch when dataset_version_resize_dimensions is None, verified
via synthetic-package test), no static_crop/grayscale/contrast,
3-channel, scaling_factor in {None, 255}, normalization set.
- Falls back to the existing PIL-based pre_process_network_input
when any precondition fails.
Also adds the benchmark driver
development/stream_interface/rfdetr_nano_seg_trt_workflow.py used to
measure the above numbers.
…hed input
Move the Triton fast-path gate from RFDetrForInstanceSegmentationTRT into
pre_process_network_input so all six RFDetr classes (seg×{TRT,ONNX,Torch}
and detect×{TRT,ONNX,Torch}) can hit it, and widen the predicate to
accept torch uint8 HWC tensors on any device plus batched inputs
(list[ndarray], list[Tensor], 4D ndarray/Tensor — the outer function
already unbinds those to lists before the per-item check).
Color-swap parity fix: the PIL path does `image[:, :, ::-1]` whenever
`input_color_mode != network_input.color_mode`, which is True for an
unspecified caller (None). The old fast-path treated None as BGR and
skipped the swap when the network was also BGR — byte-identical to PIL
for packaged seg models but diverged from PIL on og-rfdetr-base
(ColorMode.BGR network with None caller). Align the kernel swap
condition with PIL's.
Integration coverage (144 tests, CUDA 13):
baseline: 4 tests hit fast path, 6 / 160 pre_process calls
widened : 35 tests hit fast path, 43 / 166 pre_process calls
pass rate unchanged: 144 / 144.
Remaining ~100 tests miss on predicate categories that require kernel
extensions (static_crop, contrast, dataset_version_resize) and are
tracked as follow-up work.
Apply static_crop as a load-time offset in the kernel (+ crop-dims-based resample tables), matching apply_static_crop_to_numpy_image's pixel- coordinate percentage math. Extends fast-path coverage from 35 → 55 of 144 rfdetr integration tests, pass rate unchanged (144/144).
torchvision.io.read_image returns CHW uint8 — 72 test calls in the
integration suite arrive in that layout. Mirror _tensor_to_hwc_uint8's
CHW heuristic (first dim in {1,3,4} and last dim not in {1,3,4}) and
permute to HWC before the kernel. Integration coverage 55 → 113 tests
hitting fast path, zero regressions.
INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (default true). Setting it to false short-circuits _fast_path_eligible so every call falls back to the PIL reference path — useful for A/B benchmarking and as an escape hatch if the fused kernel is ever implicated in a regression. Verified on rfdetr-seg-nano: the kernel fires on every eligible call when env=true (5000/5000 on full coco/val2017) and never fires when env=false (0/5000), with byte-identical predictions in both states. e2e on vehicles_312px.mp4 (538 frames, rfdetr-seg-nano TRT): env=true : 99.3 fps env=false: 76.2 fps
All scripts are driven by INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED
(true → Triton fast path, false → PIL reference). Each run prints a
Triton kernel invocation count so it is visible from the console which
path handled each image.
Scripts:
- parity_triton_vs_pil.py — kernel-vs-PIL fp32 ULP check (20 imgs, direct
kernel call; bypasses the model stack)
- detection_parity_full.py — 5000-img end-to-end parity driver. Spawns
one subprocess per env value (so the module-
level env read is re-done), pickles per-image
detections + Triton call count, then compares.
- parity_env_var.py — same idea at 100 imgs, a quick sanity run.
- coco_map.py — bbox + segm mAP via pycocotools; run twice
with env=true/false to confirm mAP matches
to 4 decimals.
- preproc_microbench.py — isolated pre_process() timing at
312² / 720×1280 / 1080×1920.
- _fastpath_trace.py — shared instrumentation helper. Patches
_fast_path_eligible + _fast_path_preprocess +
triton_preprocess_rfdetr_stretch + the two
PIL fallbacks and prints per-surface call
counts at exit. Used by `python run_with_trace.py
<script>` for independent kill-switch audits.
Co-authored-by: Lee Clement <lee@roboflow.com>
dkosowski87
left a comment
There was a problem hiding this comment.
Really nice idea 👍 left a couple of comments relating to our setup
| import torch | ||
|
|
||
| try: | ||
| import triton |
There was a problem hiding this comment.
Right now in inference_models/uv.lock we can see that triton is installed due to being a dependency for torch. We explicitly import triton here so we have a direct dependency. Although my gut feeling here is pinning the dependancy separately may create some headaches, given that we have more than one extra with a separate torch setup. So perhaps leaving as is with the triton dependency being directly bundled with torch is a better approach. WDYT @PawelPeczek-Roboflow ?
There was a problem hiding this comment.
how about providing expplicit definition of triton with broad boundaries?
There was a problem hiding this comment.
triton now in pyproject.toml
6c974b7 to
6b4f745
Compare
dkosowski87
left a comment
There was a problem hiding this comment.
One last comment. PR looks, good. 👍
I'll do a test on our side and if everything goes ok, I'll approve and merge.
dkosowski87
left a comment
There was a problem hiding this comment.
Sorry for not mentioning this earlier, but I see a need for adding some tests around this change. At minimum:
- Numerical parity - we have the
_reference_pipelineand_build_network_inputhelpers, we can monkeypatch usingUSE_TRITON_FOR_PREPROCESSING=Trueand add proper asserts. This would be@pytest.mark.gpu_only. 2-3 pre-processing scenarios, different config. - Integration check - mock to check if
triton_path_preprocesspath is correctly being resolved given the possible configuration
|
Opening another PR with an alternative kernel which is even faster for larger images. |
What does this PR do?
Introducing a single Triton CUDA kernel which executes image pre-processing as a single kernel, eliminating kernel launch overheads and reducing CPU<->GPU memory transfers to the bare minimum. The kernel supports all options EXCEPT
dataset_version_resize(two-stage resize: cv2 dataset-version resize → PIL stretch) andcontrast(three distinct algorithms: histogram eq / CLAHE / contrast stretching; each needs its own prepass kernel).Type of Change
Testing
Test details: (Make sure triton is installed in your environment)
vehicles_312px.mp4 (538 frames, src 312×176): T4 GPU
vehicles_1080p.mp4 (538 frames, src 1920×1080 — preproc has real resize work): T4 GPU
coco/in the current working directory. Runpython temp/detection_parity_full.pypre_process()) T4 GPUhttps://drive.google.com/file/d/1aXWk0hgTsMsfUDwqF7wxqK9YrgMUPH09/view?usp=sharing
Checklist
Additional Context