fix(tracing): make SGP processor stateless to stop dropping span closes by smoreinis · Pull Request #354 · scaleapi/scale-agentex-python

smoreinis · 2026-05-11T17:43:42Z

Summary

In multi-replica Temporal deployments, START_SPAN and END_SPAN are dispatched as separate Temporal activities (adk/_modules/tracing.py:184 and :228). Temporal activity routing has no stickiness across the two, so they can land on different worker pods.
SGPAsyncTracingProcessor stored started SGPSpan objects in a per-process self._spans dict. When on_spans_end ran on a pod that never saw the START, the lookup returned None, a warning was logged, and the close upsert was silently skipped. Spans appeared stuck on "Running" in the SGP UI forever.
Bug also present on the sync SGPSyncTracingProcessor path; reported by TASMU (Government of Qatar), confirmed on agentex-sdk 0.9.6 and 0.11.0.

Fix

Drop the _spans cache entirely on both sync and async processors. Both on_spans_start and on_spans_end build a fresh SGPSpan from the incoming agentex Span. SGP's upsert_batch is idempotent on span_id, so re-sending the start fields on close is safe.

This mirrors the design of AgentexAsyncTracingProcessor, which is already stateless.

Why "fully stateless" instead of the originally-proposed cache-with-fallback

The Slack-proposed fix (keep the dict, fall back to reconstruction on cache miss) would have fixed the silent close drop but left a memory leak on the START pod: in multi-replica deployments, the dict entry on pod A is never reclaimed when END lands on pod B. Going fully stateless fixes both with one change, at the cost of one extra create_span(...) call per on_spans_end — negligible vs. the HTTP round-trip.

Throughput impact on SGP

Single-replica deployments (incl. all sync/async-ACP non-Temporal agents): no change. Two upsert_batch calls per span as before.
Multi-replica Temporal deployments: same 2 calls per span. The closes that were previously silently dropped are now actually sent — i.e., load returns to the original design intent. Not a degradation.

Tests

Replaced test_span_end_for_unknown_span_is_noop (which codified the bug as expected behavior) with test_span_end_without_prior_start_still_{flushes,upserts} — asserts the close IS sent on a pod that never saw the start.
Replaced _spans-dict assertions with test_processor_holds_no_per_span_state and lifecycle-call-count assertions.
All other batching, propagation, and MemoryLeak test coverage retained (renamed and refactored).

Test plan

rye run pytest tests/lib/core/tracing — 32 passed, 2 skipped
rye run ruff check on changed files — clean
rye run pyright on changed files — clean
Verify in TASMU's Council of Ministers env that root spans no longer get stuck on "Running"
Confirm no regression in the dev SGP dashboard with a Temporal agent at 1 replica

Greptile Summary

Removes the per-process _spans dict from both SGPSyncTracingProcessor and SGPAsyncTracingProcessor, fixing a bug where span closes were silently dropped in multi-replica Temporal deployments when the END activity lands on a different pod than START.
Introduces a module-level _build_sgp_span helper that reconstructs a fresh SGPSpan on every call (both start and end), relying on SGP's idempotent upsert_batch on span_id to safely re-send start-time fields on close.
Tests are updated to replace the stale no-op assertions (which codified the bug) with cross-pod scenario tests verifying that on_span_end without a prior on_span_start still correctly flushes and upserts.

Confidence Score: 5/5

This PR is safe to merge; the fix is well-scoped, stateless design is already used elsewhere in the codebase, and the test suite covers the new behavior thoroughly.

No P0 or P1 issues found. The root cause (stale in-process cache incompatible with multi-replica routing) is correctly diagnosed and fixed with minimal blast radius. SGP's idempotent upsert semantics make the double-write safe. Throughput analysis in the PR description confirms no regression for single-replica deployments.

No files require special attention.

Important Files Changed

Filename	Overview
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py	Drops the `_spans` dict from both processors, extracts `_build_sgp_span` as a stateless module-level helper, and simplifies `shutdown()` to `pass` (async) / `flush_queue()` (sync); logic is clean and the stateless design mirrors the existing `AgentexAsyncTracingProcessor` pattern.
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py	Replaces `_spans`-dict assertions (which encoded the bug as expected behavior) with stateless behavior tests; cross-pod scenario tests correctly initialise the mock's `start_time`/`end_time` to `None` so the is-not-None assertions are meaningful.

Sequence Diagram

sequenceDiagram
    participant TA as Temporal Activity START
    participant TB as Temporal Activity END
    participant P as SGP Processor stateless
    participant SGP as SGP Backend

    Note over TA,TB: Multi-replica Temporal deployment

    TA->>P: on_spans_start([span])
    P->>P: _build_sgp_span sets start_time
    P->>SGP: upsert_batch span Running

    Note over TB: END routed to different worker pod

    TB->>P: on_spans_end([span])
    P->>P: _build_sgp_span fresh build with start_time
    P->>P: sgp_span.end_time set
    P->>SGP: upsert_batch span Completed

    Note over SGP: span_id is idempotent key

_{Reviews (2): Last reviewed commit: "fix(tracing): address greptile review fi..." | Re-trigger Greptile}

In multi-replica Temporal deployments, START_SPAN and END_SPAN run as separate Temporal activities and can land on different worker pods. The SGP processor previously stored started SGPSpans in a per-process dict; on END the lookup would miss on the foreign pod and the close upsert was silently dropped, leaving spans stuck on "Running" in the SGP UI. Drop the _spans cache entirely. Both on_spans_start and on_spans_end build a fresh SGPSpan from the agentex Span fields; SGP's upsert_batch is idempotent on span_id, so re-sending start fields on close is safe. Side benefit: also eliminates a START-pod memory leak that the Slack-proposed fall-back variant would have left in place (dict entries on pod A are never reclaimed when END lands on pod B). Reported by TASMU (Government of Qatar). Confirmed on agentex-sdk 0.9.6 and 0.11.0.

- Extract _build_sgp_span and _add_source_to_span to module-level helpers so the sync and async processors share a single implementation, removing byte-for-byte duplication that risked drifting between the two classes. - Strengthen test_sgp_span_input_and_output_propagated_on_end to actually assert the updated input/output reach create_span on the end call, matching what the test name and docstring promise.

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread tests/lib/core/tracing/processors/test_sgp_tracing_processor.py

smoreinis requested a review from a team May 11, 2026 18:17

declan-scale approved these changes May 11, 2026

View reviewed changes

smoreinis merged commit 5e9f28d into next May 11, 2026
36 checks passed

smoreinis deleted the fix/sgp-tracing-stateless-span-end branch May 11, 2026 21:23

stainless-app Bot mentioned this pull request May 11, 2026

release: 0.12.0 #350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tracing): make SGP processor stateless to stop dropping span closes#354

fix(tracing): make SGP processor stateless to stop dropping span closes#354
smoreinis merged 2 commits into
nextfrom
fix/sgp-tracing-stateless-span-end

smoreinis commented May 11, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smoreinis commented May 11, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Why "fully stateless" instead of the originally-proposed cache-with-fallback

Throughput impact on SGP

Tests

Test plan

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smoreinis commented May 11, 2026 •

edited by greptile-apps Bot

Loading