perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25
Open
aseembits93 wants to merge 2 commits into
Open
perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25aseembits93 wants to merge 2 commits into
aseembits93 wants to merge 2 commits into
Conversation
`InferenceModelsInstanceSegmentationAdapter.postprocess` used to convert every detection's mask to a polygon via `masks2poly` (cv2 findContours), wrap each vertex in a `Point` pydantic model, and build a validated `InstanceSegmentationPrediction`. The v3 workflow block then called `model_dump` and `sv.Detections.from_inference`, which rasterized those polygons back into masks via `polygon_to_mask`. When the caller is a workflow (`request.source == "workflow-execution"`), none of that encoding is observable — the v3 block consumes an `sv.Detections` with masks. This change: * Has the adapter build `sv.Detections` directly from the numpy xyxy/confidence/class_id/mask buffers and attach it via `response.__dict__["_sv_detections_fast"]` (pydantic v2 ignores extra __dict__ keys in dump/serialize, so HTTP payloads are unaffected). The polygon+pydantic path is preserved for all other callers, including RLE responses. * Teaches the v3 block to detect the attached `sv.Detections` and route through a new `_post_process_result_fast`, skipping `model_dump` + `convert_inference_detections_batch_to_sv_detections` entirely. Benchmark on a T4 with rfdetr-seg-nano TRT + Triton preproc + full-Triton postproc + CUDA graphs, streaming vehicles_312px.mp4 (538 frames) via `InferencePipeline`: * baseline (4 runs): mean 151.80 FPS * this change (4 runs): mean 164.51 FPS * **+12.7 FPS, ~+8.4%** Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous fast path handed raw GPU masks straight to `sv.Detections`, skipping the `masks2poly` → `polygon_to_mask` round-trip that the slow path ran. That round-trip has two behavioral side-effects the fast path was inadvertently dropping: 1. Largest-component-only: `findContours(RETR_EXTERNAL)` + picking the contour with the most vertices drops disconnected mask fragments. 2. Hole-filling: `RETR_EXTERNAL` ignores inner contours, so `fillPoly(largest_contour)` fills any holes inside the shape. Plus `filter_out_invalid_polygons` + the `>= 3` vertex check in `supervision.process_roboflow_result` drop detections whose largest contour has fewer than 3 points. This change reproduces the slow-path mask semantics inside `_build_workflow_fastpath_response` by running the same `findContours(RETR_EXTERNAL, CHAIN_APPROX_SIMPLE)` + `fillPoly` per mask, and dropping detections whose largest contour has fewer than 3 vertices. It also factors the shared attr name into `SV_DETECTIONS_FAST_ATTR` in `inference/core/entities/responses/inference.py`. Verified bit-exact mask equality vs the slow path on synthetic masks with disconnected fragments and interior holes. Benchmark on a T4 with the full Triton preproc + full-postproc + CUDA-graphs stack, streaming vehicles_312px.mp4 (538 frames) via InferencePipeline: * baseline (no fast path): mean 152.33 FPS * fast path WITHOUT denoising (wrong): mean 164.51 FPS (+12.2, +8.0%) * **fast path WITH denoising (this change): mean 163.43 FPS (+11.1, +7.3%)** Denoising costs ~1 FPS (~0.7%) because both paths run the same `findContours + fillPoly`; the fast path still eliminates pydantic validation for Point/InstanceSegmentationPrediction, `model_dump`, and the second rasterization inside `sv.Detections.from_inference` → `polygon_to_mask`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the instance-segmentation adapter is invoked from a workflow, the result goes through a pure-overhead encoding round-trip:
Nothing between the adapter output and the
sv.Detectionssink observes the polygon form, yet we pay polygon extraction + pydantic validation + polygon→mask rasterization per frame.This change short-circuits the round-trip when
request.source == "workflow-execution":sv.Detectionsdirectly from the GPU-derived numpy arrays and attaches it viaresponse.__dict__["_sv_detections_fast"]. Pydantic v2 ignores extra__dict__keys inmodel_dump/jsonable_encoder, so HTTP callers are unaffected._post_process_result_fast, which attachesdetection_id/parent_id/image_dimensions/inference_iddirectly onto the pre-builtsv.Detectionsand skipsmodel_dump+convert_inference_detections_batch_to_sv_detectionsentirely.Falls back to the existing polygon path whenever the marker is absent (HTTP responses, RLE responses via
response_mask_format=rle, non-tensor masks, mixed-source batches).Benchmark
rfdetr-seg-nano (TRT) + Triton preproc + full-Triton postproc + CUDA graphs,
vehicles_312px.mp4(538 frames) viaInferencePipeline, T4 GPU:+12.7 FPS, ~+8.4%. Same flags, same 538-frame window.
Test plan
pytest tests/workflows/unit_tests/core_steps/models/roboflow/instance_segmentation/test_v3.py— 23/23 passRFDETR_USE_TRITON_PREPROC=true RFDETR_TRITON_FULLPOSTPROC=true ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true)_sv_detections_fastprivate attr through pydantic serialization)response_mask_format=rle) to confirm it still goes through the original polygon/RLE branch🤖 Generated with Claude Code