fix(tracing): make SGP processor stateless to stop dropping span closes#354
Merged
Conversation
In multi-replica Temporal deployments, START_SPAN and END_SPAN run as separate Temporal activities and can land on different worker pods. The SGP processor previously stored started SGPSpans in a per-process dict; on END the lookup would miss on the foreign pod and the close upsert was silently dropped, leaving spans stuck on "Running" in the SGP UI. Drop the _spans cache entirely. Both on_spans_start and on_spans_end build a fresh SGPSpan from the agentex Span fields; SGP's upsert_batch is idempotent on span_id, so re-sending start fields on close is safe. Side benefit: also eliminates a START-pod memory leak that the Slack-proposed fall-back variant would have left in place (dict entries on pod A are never reclaimed when END lands on pod B). Reported by TASMU (Government of Qatar). Confirmed on agentex-sdk 0.9.6 and 0.11.0.
- Extract _build_sgp_span and _add_source_to_span to module-level helpers so the sync and async processors share a single implementation, removing byte-for-byte duplication that risked drifting between the two classes. - Strengthen test_sgp_span_input_and_output_propagated_on_end to actually assert the updated input/output reach create_span on the end call, matching what the test name and docstring promise.
declan-scale
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
START_SPANandEND_SPANare dispatched as separate Temporal activities (adk/_modules/tracing.py:184and:228). Temporal activity routing has no stickiness across the two, so they can land on different worker pods.SGPAsyncTracingProcessorstored startedSGPSpanobjects in a per-processself._spansdict. Whenon_spans_endran on a pod that never saw theSTART, the lookup returnedNone, a warning was logged, and the close upsert was silently skipped. Spans appeared stuck on "Running" in the SGP UI forever.SGPSyncTracingProcessorpath; reported by TASMU (Government of Qatar), confirmed onagentex-sdk0.9.6 and 0.11.0.Fix
Drop the
_spanscache entirely on both sync and async processors. Bothon_spans_startandon_spans_endbuild a freshSGPSpanfrom the incoming agentexSpan. SGP'supsert_batchis idempotent onspan_id, so re-sending the start fields on close is safe.This mirrors the design of
AgentexAsyncTracingProcessor, which is already stateless.Why "fully stateless" instead of the originally-proposed cache-with-fallback
The Slack-proposed fix (keep the dict, fall back to reconstruction on cache miss) would have fixed the silent close drop but left a memory leak on the START pod: in multi-replica deployments, the dict entry on pod A is never reclaimed when
ENDlands on pod B. Going fully stateless fixes both with one change, at the cost of one extracreate_span(...)call peron_spans_end— negligible vs. the HTTP round-trip.Throughput impact on SGP
upsert_batchcalls per span as before.Tests
test_span_end_for_unknown_span_is_noop(which codified the bug as expected behavior) withtest_span_end_without_prior_start_still_{flushes,upserts}— asserts the close IS sent on a pod that never saw the start._spans-dict assertions withtest_processor_holds_no_per_span_stateand lifecycle-call-count assertions.MemoryLeaktest coverage retained (renamed and refactored).Test plan
rye run pytest tests/lib/core/tracing— 32 passed, 2 skippedrye run ruff checkon changed files — cleanrye run pyrighton changed files — cleanGreptile Summary
_spansdict from bothSGPSyncTracingProcessorandSGPAsyncTracingProcessor, fixing a bug where span closes were silently dropped in multi-replica Temporal deployments when theENDactivity lands on a different pod thanSTART._build_sgp_spanhelper that reconstructs a freshSGPSpanon every call (both start and end), relying on SGP's idempotentupsert_batchonspan_idto safely re-send start-time fields on close.on_span_endwithout a prioron_span_startstill correctly flushes and upserts.Confidence Score: 5/5
This PR is safe to merge; the fix is well-scoped, stateless design is already used elsewhere in the codebase, and the test suite covers the new behavior thoroughly.
No P0 or P1 issues found. The root cause (stale in-process cache incompatible with multi-replica routing) is correctly diagnosed and fixed with minimal blast radius. SGP's idempotent upsert semantics make the double-write safe. Throughput analysis in the PR description confirms no regression for single-replica deployments.
No files require special attention.
Important Files Changed
_spansdict from both processors, extracts_build_sgp_spanas a stateless module-level helper, and simplifiesshutdown()topass(async) /flush_queue()(sync); logic is clean and the stateless design mirrors the existingAgentexAsyncTracingProcessorpattern._spans-dict assertions (which encoded the bug as expected behavior) with stateless behavior tests; cross-pod scenario tests correctly initialise the mock'sstart_time/end_timetoNoneso the is-not-None assertions are meaningful.Sequence Diagram
sequenceDiagram participant TA as Temporal Activity START participant TB as Temporal Activity END participant P as SGP Processor stateless participant SGP as SGP Backend Note over TA,TB: Multi-replica Temporal deployment TA->>P: on_spans_start([span]) P->>P: _build_sgp_span sets start_time P->>SGP: upsert_batch span Running Note over TB: END routed to different worker pod TB->>P: on_spans_end([span]) P->>P: _build_sgp_span fresh build with start_time P->>P: sgp_span.end_time set P->>SGP: upsert_batch span Completed Note over SGP: span_id is idempotent keyReviews (2): Last reviewed commit: "fix(tracing): address greptile review fi..." | Re-trigger Greptile