Skip to content

fix(tracing): make SGP processor stateless to stop dropping span closes#354

Merged
smoreinis merged 2 commits into
nextfrom
fix/sgp-tracing-stateless-span-end
May 11, 2026
Merged

fix(tracing): make SGP processor stateless to stop dropping span closes#354
smoreinis merged 2 commits into
nextfrom
fix/sgp-tracing-stateless-span-end

Conversation

@smoreinis
Copy link
Copy Markdown
Contributor

@smoreinis smoreinis commented May 11, 2026

Summary

  • In multi-replica Temporal deployments, START_SPAN and END_SPAN are dispatched as separate Temporal activities (adk/_modules/tracing.py:184 and :228). Temporal activity routing has no stickiness across the two, so they can land on different worker pods.
  • SGPAsyncTracingProcessor stored started SGPSpan objects in a per-process self._spans dict. When on_spans_end ran on a pod that never saw the START, the lookup returned None, a warning was logged, and the close upsert was silently skipped. Spans appeared stuck on "Running" in the SGP UI forever.
  • Bug also present on the sync SGPSyncTracingProcessor path; reported by TASMU (Government of Qatar), confirmed on agentex-sdk 0.9.6 and 0.11.0.

Fix

Drop the _spans cache entirely on both sync and async processors. Both on_spans_start and on_spans_end build a fresh SGPSpan from the incoming agentex Span. SGP's upsert_batch is idempotent on span_id, so re-sending the start fields on close is safe.

This mirrors the design of AgentexAsyncTracingProcessor, which is already stateless.

Why "fully stateless" instead of the originally-proposed cache-with-fallback

The Slack-proposed fix (keep the dict, fall back to reconstruction on cache miss) would have fixed the silent close drop but left a memory leak on the START pod: in multi-replica deployments, the dict entry on pod A is never reclaimed when END lands on pod B. Going fully stateless fixes both with one change, at the cost of one extra create_span(...) call per on_spans_end — negligible vs. the HTTP round-trip.

Throughput impact on SGP

  • Single-replica deployments (incl. all sync/async-ACP non-Temporal agents): no change. Two upsert_batch calls per span as before.
  • Multi-replica Temporal deployments: same 2 calls per span. The closes that were previously silently dropped are now actually sent — i.e., load returns to the original design intent. Not a degradation.

Tests

  • Replaced test_span_end_for_unknown_span_is_noop (which codified the bug as expected behavior) with test_span_end_without_prior_start_still_{flushes,upserts} — asserts the close IS sent on a pod that never saw the start.
  • Replaced _spans-dict assertions with test_processor_holds_no_per_span_state and lifecycle-call-count assertions.
  • All other batching, propagation, and MemoryLeak test coverage retained (renamed and refactored).

Test plan

  • rye run pytest tests/lib/core/tracing — 32 passed, 2 skipped
  • rye run ruff check on changed files — clean
  • rye run pyright on changed files — clean
  • Verify in TASMU's Council of Ministers env that root spans no longer get stuck on "Running"
  • Confirm no regression in the dev SGP dashboard with a Temporal agent at 1 replica

Greptile Summary

  • Removes the per-process _spans dict from both SGPSyncTracingProcessor and SGPAsyncTracingProcessor, fixing a bug where span closes were silently dropped in multi-replica Temporal deployments when the END activity lands on a different pod than START.
  • Introduces a module-level _build_sgp_span helper that reconstructs a fresh SGPSpan on every call (both start and end), relying on SGP's idempotent upsert_batch on span_id to safely re-send start-time fields on close.
  • Tests are updated to replace the stale no-op assertions (which codified the bug) with cross-pod scenario tests verifying that on_span_end without a prior on_span_start still correctly flushes and upserts.

Confidence Score: 5/5

This PR is safe to merge; the fix is well-scoped, stateless design is already used elsewhere in the codebase, and the test suite covers the new behavior thoroughly.

No P0 or P1 issues found. The root cause (stale in-process cache incompatible with multi-replica routing) is correctly diagnosed and fixed with minimal blast radius. SGP's idempotent upsert semantics make the double-write safe. Throughput analysis in the PR description confirms no regression for single-replica deployments.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Drops the _spans dict from both processors, extracts _build_sgp_span as a stateless module-level helper, and simplifies shutdown() to pass (async) / flush_queue() (sync); logic is clean and the stateless design mirrors the existing AgentexAsyncTracingProcessor pattern.
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py Replaces _spans-dict assertions (which encoded the bug as expected behavior) with stateless behavior tests; cross-pod scenario tests correctly initialise the mock's start_time/end_time to None so the is-not-None assertions are meaningful.

Sequence Diagram

sequenceDiagram
    participant TA as Temporal Activity START
    participant TB as Temporal Activity END
    participant P as SGP Processor stateless
    participant SGP as SGP Backend

    Note over TA,TB: Multi-replica Temporal deployment

    TA->>P: on_spans_start([span])
    P->>P: _build_sgp_span sets start_time
    P->>SGP: upsert_batch span Running

    Note over TB: END routed to different worker pod

    TB->>P: on_spans_end([span])
    P->>P: _build_sgp_span fresh build with start_time
    P->>P: sgp_span.end_time set
    P->>SGP: upsert_batch span Completed

    Note over SGP: span_id is idempotent key
Loading

Reviews (2): Last reviewed commit: "fix(tracing): address greptile review fi..." | Re-trigger Greptile

In multi-replica Temporal deployments, START_SPAN and END_SPAN run as
separate Temporal activities and can land on different worker pods.
The SGP processor previously stored started SGPSpans in a per-process
dict; on END the lookup would miss on the foreign pod and the close
upsert was silently dropped, leaving spans stuck on "Running" in the
SGP UI.

Drop the _spans cache entirely. Both on_spans_start and on_spans_end
build a fresh SGPSpan from the agentex Span fields; SGP's upsert_batch
is idempotent on span_id, so re-sending start fields on close is safe.

Side benefit: also eliminates a START-pod memory leak that the
Slack-proposed fall-back variant would have left in place (dict entries
on pod A are never reclaimed when END lands on pod B).

Reported by TASMU (Government of Qatar). Confirmed on agentex-sdk 0.9.6
and 0.11.0.
Comment thread tests/lib/core/tracing/processors/test_sgp_tracing_processor.py
@smoreinis smoreinis requested a review from a team May 11, 2026 18:17
- Extract _build_sgp_span and _add_source_to_span to module-level helpers
  so the sync and async processors share a single implementation, removing
  byte-for-byte duplication that risked drifting between the two classes.
- Strengthen test_sgp_span_input_and_output_propagated_on_end to actually
  assert the updated input/output reach create_span on the end call,
  matching what the test name and docstring promise.
@smoreinis smoreinis merged commit 5e9f28d into next May 11, 2026
36 checks passed
@smoreinis smoreinis deleted the fix/sgp-tracing-stateless-span-end branch May 11, 2026 21:23
@stainless-app stainless-app Bot mentioned this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants