rfdetr-seg: single-kernel Triton full post-processing#2378
Closed
aseembits93 wants to merge 27 commits into
Closed
rfdetr-seg: single-kernel Triton full post-processing#2378aseembits93 wants to merge 27 commits into
aseembits93 wants to merge 27 commits into
Conversation
…ULLPOSTPROC
Adds two fused Triton kernels that replace the post-TRT chain for the
common rfdetr-seg-nano path (batch=1, no static crop, STRETCH_TO resize,
class remapping present):
- _rfdetr_fullpost_filter_kernel: sigmoid/argmax/class remap/conf threshold
+ cxcywh->xyxy + letterbox denormalize + clip + banker's rounding;
atomic_add into a counter to reserve a compact output slot.
- _rfdetr_fullpost_mask_kernel_compact: bilinear upsample masks to orig
size + threshold + uint8 emit, with GPU-side early exit against the
counter so no CPU sync is needed between the two kernels.
Dispatch is gated on ``RFDETR_TRITON_FULLPOSTPROC=true``; callers keep the
torch reference path when Triton is unavailable or the eligibility checks
fail. The adapter reads a 4-byte counter via a pinned DtoH to learn
``n_survivors`` and then async-DtoHs the compact combined/mask slices.
The non-RLE fullpost path was returning the full num_queries-row scratch buffer to InstanceDetections, exposing uninitialized rows past the survivor counter and leaving the conf column as int32 bits. Wait on done_event, slice combined and mask_bin to [:n_survivors], and reinterpret the conf column with .view(torch.float32) — mirroring the RLE variant. Adds temp/detection_parity_full.py to study the fused path against the torch reference across coco/val2017.
The filter kernel was per-query argmax, but the torch reference does top-k-flat over the (Q*C) sigmoid grid (num_select == num_queries) — a single query can contribute multiple detections, one per class that survives remap + threshold. Per-query argmax silently dropped the secondary classes. Reshape the kernel grid to (num_queries, num_classes_total) so each (q, c) pair is processed independently: load one logit, remap, sigmoid, threshold, emit. Cap output at num_queries to mirror the reference's top-K cap; host clamps n_survivors to combined.shape[0] since the atomic counter increments before the slot guard. Validated on coco/val2017 (1500 images): det counts match exactly 8037/8037 vs 7995/8037 before, and on the same-logits parity script all 1663 matched detections agree to fp32 epsilon when the matcher is class-aware.
The custom Triton mask kernel implemented bilinear+threshold with `antialias=False` semantics, but the reference path (`align_instance_segmentation_results`) calls `functional.resize(BILINEAR)` which defaults to `antialias=True` — producing a different fp32 reduction order on boundary pixels and flipping ~0.3 % of mask pixels (26/8037 across 1500 coco/val2017 images at conf=0.4). Drop the mask kernel and replace it with a host-side gather + `F.interpolate(bilinear, antialias=True, align_corners=False)` + `> 0` that matches the reference bit-for-bit. Same eligibility window guarantees no static crop and `size_after_pre_processing == inference_size`, so the new path can skip the canvas/static-crop branches. Validated on coco/val2017 (1500 images): 8037/8037 pixel-identical masks (was 8011/8037). FPS unchanged within noise (filter kernel dominates the postproc cost; the mask path was always cuDNN-bound).
Contributor
Author
|
Opening another PR with an alternative kernel which is even faster for larger images. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR opens the full RF-DETR segmentation Triton work behind
RFDETR_TRITON_POSTPROC=true:kernel: flat top-k over the
(Q x C)sigmoid grid, class remap /thresholding, box decode / scaling / banker's rounding, full mask resize /
threshold, packed workflow masks, and optional in-kernel COCO RLE
serialization.
precondition fails.
Type of Change
Testing
Test details:
vehicles_312px.mp4 (538 frames, src 312x176):
vehicles_720p.mp4 (538 frames, src 1280x720):
vehicles_1080p.mp4 (538 frames, src 1920x1080):
./inference_modelsfolder)Explicit RLE subprocess parity on 50
coco/val2017images withmask_format='rle', response_mask_format='rle'->
278 / 278pixel-identical decoded masks,0count-mismatch images,max
|Δscore| = 1.192e-07.Unit tests
Checklist
hard-to-understand areas
Additional Context