perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path by aseembits93 · Pull Request #25 · aseembits93/inference

aseembits93 · 2026-04-30T03:00:09Z

Summary

When the instance-segmentation adapter is invoked from a workflow, the result goes through a pure-overhead encoding round-trip:

adapter.postprocess:
    GPU masks → masks2poly (cv2.findContours, N times)
              → List[Point(x,y)] pydantic validation per vertex
              → InstanceSegmentationPrediction (validated)
v3 block.run_locally:
    predictions → model_dump(by_alias=True) per response
                → sv.Detections.from_inference → polygon_to_mask (rasterize AGAIN)

Nothing between the adapter output and the sv.Detections sink observes the polygon form, yet we pay polygon extraction + pydantic validation + polygon→mask rasterization per frame.

This change short-circuits the round-trip when request.source == "workflow-execution":

The adapter builds sv.Detections directly from the GPU-derived numpy arrays and attaches it via response.__dict__["_sv_detections_fast"]. Pydantic v2 ignores extra __dict__ keys in model_dump / jsonable_encoder, so HTTP callers are unaffected.
The v3 block detects the marker and routes through a new _post_process_result_fast, which attaches detection_id/parent_id/image_dimensions/inference_id directly onto the pre-built sv.Detections and skips model_dump + convert_inference_detections_batch_to_sv_detections entirely.

Falls back to the existing polygon path whenever the marker is absent (HTTP responses, RLE responses via response_mask_format=rle, non-tensor masks, mixed-source batches).

Benchmark

rfdetr-seg-nano (TRT) + Triton preproc + full-Triton postproc + CUDA graphs, vehicles_312px.mp4 (538 frames) via InferencePipeline, T4 GPU:

	Run 1	Run 2	Run 3	Run 4	mean
baseline (main)	151.96	151.29	151.39	152.54	151.80
this change	165.63	164.09	163.21	165.09	164.51

+12.7 FPS, ~+8.4%. Same flags, same 538-frame window.

Test plan

pytest tests/workflows/unit_tests/core_steps/models/roboflow/instance_segmentation/test_v3.py — 23/23 pass
Manual benchmark: 4 baseline runs vs 4 optimized runs on the same video with the same env flags (RFDETR_USE_TRITON_PREPROC=true RFDETR_TRITON_FULLPOSTPROC=true ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true)
Exercise the HTTP path to confirm serialized payloads are byte-identical to main (no leakage of the _sv_detections_fast private attr through pydantic serialization)
Exercise RLE response path (response_mask_format=rle) to confirm it still goes through the original polygon/RLE branch

🤖 Generated with Claude Code

`InferenceModelsInstanceSegmentationAdapter.postprocess` used to convert every detection's mask to a polygon via `masks2poly` (cv2 findContours), wrap each vertex in a `Point` pydantic model, and build a validated `InstanceSegmentationPrediction`. The v3 workflow block then called `model_dump` and `sv.Detections.from_inference`, which rasterized those polygons back into masks via `polygon_to_mask`. When the caller is a workflow (`request.source == "workflow-execution"`), none of that encoding is observable — the v3 block consumes an `sv.Detections` with masks. This change: * Has the adapter build `sv.Detections` directly from the numpy xyxy/confidence/class_id/mask buffers and attach it via `response.__dict__["_sv_detections_fast"]` (pydantic v2 ignores extra __dict__ keys in dump/serialize, so HTTP payloads are unaffected). The polygon+pydantic path is preserved for all other callers, including RLE responses. * Teaches the v3 block to detect the attached `sv.Detections` and route through a new `_post_process_result_fast`, skipping `model_dump` + `convert_inference_detections_batch_to_sv_detections` entirely. Benchmark on a T4 with rfdetr-seg-nano TRT + Triton preproc + full-Triton postproc + CUDA graphs, streaming vehicles_312px.mp4 (538 frames) via `InferencePipeline`: * baseline (4 runs): mean 151.80 FPS * this change (4 runs): mean 164.51 FPS * **+12.7 FPS, ~+8.4%** Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous fast path handed raw GPU masks straight to `sv.Detections`, skipping the `masks2poly` → `polygon_to_mask` round-trip that the slow path ran. That round-trip has two behavioral side-effects the fast path was inadvertently dropping: 1. Largest-component-only: `findContours(RETR_EXTERNAL)` + picking the contour with the most vertices drops disconnected mask fragments. 2. Hole-filling: `RETR_EXTERNAL` ignores inner contours, so `fillPoly(largest_contour)` fills any holes inside the shape. Plus `filter_out_invalid_polygons` + the `>= 3` vertex check in `supervision.process_roboflow_result` drop detections whose largest contour has fewer than 3 points. This change reproduces the slow-path mask semantics inside `_build_workflow_fastpath_response` by running the same `findContours(RETR_EXTERNAL, CHAIN_APPROX_SIMPLE)` + `fillPoly` per mask, and dropping detections whose largest contour has fewer than 3 vertices. It also factors the shared attr name into `SV_DETECTIONS_FAST_ATTR` in `inference/core/entities/responses/inference.py`. Verified bit-exact mask equality vs the slow path on synthetic masks with disconnected fragments and interior holes. Benchmark on a T4 with the full Triton preproc + full-postproc + CUDA-graphs stack, streaming vehicles_312px.mp4 (538 frames) via InferencePipeline: * baseline (no fast path): mean 152.33 FPS * fast path WITHOUT denoising (wrong): mean 164.51 FPS (+12.2, +8.0%) * **fast path WITH denoising (this change): mean 163.43 FPS (+11.1, +7.3%)** Denoising costs ~1 FPS (~0.7%) because both paths run the same `findContours + fillPoly`; the fast path still eliminates pydantic validation for Point/InstanceSegmentationPrediction, `model_dump`, and the second rasterization inside `sv.Detections.from_inference` → `polygon_to_mask`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude added 2 commits April 30, 2026 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25

perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25
aseembits93 wants to merge 2 commits into
mainfrom
perf/rfdetr-seg-workflow-fastpath

aseembits93 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aseembits93 commented Apr 30, 2026

Summary

Benchmark

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants