Skip to content

perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25

Open
aseembits93 wants to merge 2 commits into
mainfrom
perf/rfdetr-seg-workflow-fastpath
Open

perf(rfdetr-seg): skip mask→poly→mask round-trip on workflow path#25
aseembits93 wants to merge 2 commits into
mainfrom
perf/rfdetr-seg-workflow-fastpath

Conversation

@aseembits93
Copy link
Copy Markdown
Owner

Summary

When the instance-segmentation adapter is invoked from a workflow, the result goes through a pure-overhead encoding round-trip:

adapter.postprocess:
    GPU masks → masks2poly (cv2.findContours, N times)
              → List[Point(x,y)] pydantic validation per vertex
              → InstanceSegmentationPrediction (validated)
v3 block.run_locally:
    predictions → model_dump(by_alias=True) per response
                → sv.Detections.from_inference → polygon_to_mask (rasterize AGAIN)

Nothing between the adapter output and the sv.Detections sink observes the polygon form, yet we pay polygon extraction + pydantic validation + polygon→mask rasterization per frame.

This change short-circuits the round-trip when request.source == "workflow-execution":

  • The adapter builds sv.Detections directly from the GPU-derived numpy arrays and attaches it via response.__dict__["_sv_detections_fast"]. Pydantic v2 ignores extra __dict__ keys in model_dump / jsonable_encoder, so HTTP callers are unaffected.
  • The v3 block detects the marker and routes through a new _post_process_result_fast, which attaches detection_id/parent_id/image_dimensions/inference_id directly onto the pre-built sv.Detections and skips model_dump + convert_inference_detections_batch_to_sv_detections entirely.

Falls back to the existing polygon path whenever the marker is absent (HTTP responses, RLE responses via response_mask_format=rle, non-tensor masks, mixed-source batches).

Benchmark

rfdetr-seg-nano (TRT) + Triton preproc + full-Triton postproc + CUDA graphs, vehicles_312px.mp4 (538 frames) via InferencePipeline, T4 GPU:

Run 1 Run 2 Run 3 Run 4 mean
baseline (main) 151.96 151.29 151.39 152.54 151.80
this change 165.63 164.09 163.21 165.09 164.51

+12.7 FPS, ~+8.4%. Same flags, same 538-frame window.

Test plan

  • pytest tests/workflows/unit_tests/core_steps/models/roboflow/instance_segmentation/test_v3.py — 23/23 pass
  • Manual benchmark: 4 baseline runs vs 4 optimized runs on the same video with the same env flags (RFDETR_USE_TRITON_PREPROC=true RFDETR_TRITON_FULLPOSTPROC=true ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND=true)
  • Exercise the HTTP path to confirm serialized payloads are byte-identical to main (no leakage of the _sv_detections_fast private attr through pydantic serialization)
  • Exercise RLE response path (response_mask_format=rle) to confirm it still goes through the original polygon/RLE branch

🤖 Generated with Claude Code

claude added 2 commits April 30, 2026 02:59
`InferenceModelsInstanceSegmentationAdapter.postprocess` used to
convert every detection's mask to a polygon via `masks2poly` (cv2
findContours), wrap each vertex in a `Point` pydantic model, and
build a validated `InstanceSegmentationPrediction`. The v3 workflow
block then called `model_dump` and
`sv.Detections.from_inference`, which rasterized those polygons
back into masks via `polygon_to_mask`.

When the caller is a workflow (`request.source == "workflow-execution"`),
none of that encoding is observable — the v3 block consumes an
`sv.Detections` with masks. This change:

* Has the adapter build `sv.Detections` directly from the
  numpy xyxy/confidence/class_id/mask buffers and attach it via
  `response.__dict__["_sv_detections_fast"]` (pydantic v2 ignores
  extra __dict__ keys in dump/serialize, so HTTP payloads are
  unaffected). The polygon+pydantic path is preserved for all
  other callers, including RLE responses.
* Teaches the v3 block to detect the attached `sv.Detections` and
  route through a new `_post_process_result_fast`, skipping
  `model_dump` + `convert_inference_detections_batch_to_sv_detections`
  entirely.

Benchmark on a T4 with rfdetr-seg-nano TRT + Triton preproc +
full-Triton postproc + CUDA graphs, streaming vehicles_312px.mp4
(538 frames) via `InferencePipeline`:

* baseline (4 runs): mean 151.80 FPS
* this change  (4 runs): mean 164.51 FPS
* **+12.7 FPS, ~+8.4%**

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous fast path handed raw GPU masks straight to `sv.Detections`,
skipping the `masks2poly` → `polygon_to_mask` round-trip that the slow
path ran. That round-trip has two behavioral side-effects the fast
path was inadvertently dropping:

1. Largest-component-only: `findContours(RETR_EXTERNAL)` + picking the
   contour with the most vertices drops disconnected mask fragments.
2. Hole-filling: `RETR_EXTERNAL` ignores inner contours, so
   `fillPoly(largest_contour)` fills any holes inside the shape.

Plus `filter_out_invalid_polygons` + the `>= 3` vertex check in
`supervision.process_roboflow_result` drop detections whose largest
contour has fewer than 3 points.

This change reproduces the slow-path mask semantics inside
`_build_workflow_fastpath_response` by running the same
`findContours(RETR_EXTERNAL, CHAIN_APPROX_SIMPLE)` + `fillPoly`
per mask, and dropping detections whose largest contour has fewer
than 3 vertices. It also factors the shared attr name into
`SV_DETECTIONS_FAST_ATTR` in `inference/core/entities/responses/inference.py`.

Verified bit-exact mask equality vs the slow path on synthetic masks
with disconnected fragments and interior holes.

Benchmark on a T4 with the full Triton preproc + full-postproc +
CUDA-graphs stack, streaming vehicles_312px.mp4 (538 frames) via
InferencePipeline:

* baseline (no fast path):           mean 152.33 FPS
* fast path WITHOUT denoising (wrong): mean 164.51 FPS (+12.2, +8.0%)
* **fast path WITH denoising (this change): mean 163.43 FPS (+11.1, +7.3%)**

Denoising costs ~1 FPS (~0.7%) because both paths run the same
`findContours + fillPoly`; the fast path still eliminates pydantic
validation for Point/InstanceSegmentationPrediction, `model_dump`,
and the second rasterization inside
`sv.Detections.from_inference` → `polygon_to_mask`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants