Skip to content

fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout#526

Merged
PureWeen merged 12 commits intomainfrom
shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive
Apr 6, 2026
Merged

fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout#526
PureWeen merged 12 commits intomainfrom
shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Apr 6, 2026

Summary

Two bugs observed when a worker got stuck on a bash tool call during a Reflect loop run, causing the orchestrator to sit idle for 33+ minutes until the server's session idle-timeout killed it and left the loop hanging with no recovery.

Bug 1 — ForceCompleteProcessingAsync doesn't clear SDK pending tool state

When the pre-dispatch 150s timeout fires on a worker with an active tool call, ForceCompleteProcessingAsync resets PolyPilot's tracking state (ActiveToolCallCount, IsProcessing, etc.) but never calls AbortAsync on the CLI session. The CLI is still blocked waiting for tool results that will never arrive. The next SendAsync succeeds at the transport level but the CLI silently drops it — the message is queued behind the stuck tool and never processed, so SendPromptAndWaitAsync hangs indefinitely.

Fix: capture hadActiveTool before clearing ActiveToolCallCount, then call session.AbortAsync() after the UI-thread cleanup when true. This mirrors the RESUME-QUIESCE abort in CopilotService.Persistence.cs (line 496). AbortAsync failure is non-fatal and logged.

Bug 2 — Reflect loop has no collection timeout on worker wait

SendViaOrchestratorAsync (non-reflect path) wraps Task.WhenAll(workerTasks) with OrchestratorCollectionTimeout (15 min) and force-completes stuck workers. SendViaOrchestratorReflectAsync had no equivalent — a single stuck worker blocked the entire reflect loop indefinitely. This starved the orchestrator's CLI session of activity until the server killed it (~35-min idle timeout), and left the loop hanging with no recovery path for hours.

Fix: apply the same bounded-wait + force-complete pattern to the reflect loop's worker dispatch. Exact mirror of the non-reflect path.

Test plan

  • Worker gets stuck mid-tool during a Reflect loop → loop force-completes after OrchestratorCollectionTimeout (15 min) and continues to the synthesis phase with partial results
  • Pre-dispatch 150s timeout fires on a worker with an active tool call → AbortAsync is called; next dispatch to that worker succeeds (not silently dropped)
  • AbortAsync failure is logged and non-fatal; dispatch continues normally

🤖 Generated with Claude Code

…orker timeout

Two bugs observed when worker-2 got stuck on a bash tool call (wc -l that
hung) during a Reflect loop run, causing the orchestrator to be idle for
33+ minutes until the server's session idle-timeout killed it:

**Bug 1 — ForceCompleteProcessingAsync doesn't clear SDK pending tool state**
When the pre-dispatch 150s timeout fires on a worker with an active tool
call, ForceCompleteProcessingAsync resets PolyPilot's tracking state but
never calls AbortAsync on the CLI session. The CLI is still blocked waiting
for tool results that will never arrive. The next SendAsync succeeds at the
transport level but the CLI silently drops it — the message is queued behind
the stuck tool and never processed.

Fix: capture hadActiveTool before clearing ActiveToolCallCount, then call
session.AbortAsync() after the UI-thread cleanup when true. This is the same
approach the RESUME-QUIESCE path in CopilotService.Persistence.cs uses.
AbortAsync failure is non-fatal and logged.

**Bug 2 — Reflect loop has no collection timeout on worker wait**
SendViaOrchestratorAsync (non-reflect) wraps Task.WhenAll(workerTasks) with
OrchestratorCollectionTimeout (15 min) and force-completes stuck workers.
SendViaOrchestratorReflectAsync had no equivalent — a single stuck worker
blocked the entire loop indefinitely, starving the orchestrator's CLI session
of activity until the server killed it (~35 min idle timeout), leaving the
reflect loop hanging with no recovery for hours.

Fix: apply the same bounded-wait + force-complete pattern to the reflect
loop's worker dispatch. Mirrors the non-reflect path exactly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 6, 2026

🔍 Multi-Model Code Review — PR #526 (Re-review #5 — Updated)

PR: fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout
Branch: shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive
Tests: ✅ 3294 passed, 0 failed (local dotnet test)
CI: ⚠️ No checks reported on the branch (no CI configured for this branch)


Status of All Findings

# Finding Status
1 Unbounded AbortAsync in ForceCompleteProcessingAsync ✅ FIXED — bounded with 15s CTS
2 MergeSessionEntries dropping cross-group sessions without recovery proof ✅ FIXED — RecoveredFromSessionId guard
3 Worker names lost as "unknown" during timeout collection ✅ FIXED — indexed loop preserves assignments[i].WorkerName
4 Replay dedup too broad (cross-turn suppression) ✅ FIXED — now same-turn only via LastFlushedResponseSegment
5 ClearFlushedReplayDedup cross-thread race on ARM64 ✅ FIXED — d662403f6 converts both fields to volatile, orders writes safely

Latest Fix: d662403f6

The sole remaining finding from re-review #5 — unanimously flagged by all 3 reviewers — was that LastFlushedResponseSegment and FlushedReplayDedupArmed were plain auto-properties written from the SDK background thread (ToolExecutionStartEvent, AssistantTurnStartEvent) but read on the UI thread. On ARM64 (Apple Silicon), the weak memory model could cause stale reads leading to silent content loss.

Fix applied:

  • Converted both fields from auto-properties to volatile fields (matching the existing HasUsedToolsThisTurn pattern on the adjacent line)
  • Reordered ClearFlushedReplayDedup to clear armed before segment — concurrent readers check armed first, so seeing false short-circuits safely even if segment has not been cleared yet
  • All 3294 tests pass after the change

Test Coverage

Comprehensive coverage across the PR:

  • Same-turn replay dedup (flush + complete, multi-paragraph)
  • Cross-turn identical replies (must persist)
  • Same-turn after tool boundary (must persist after ClearFlushedReplayDedup)
  • Streaming duplicate suppression, draft preservation (source-level)
  • RecoveredFromSessionId merge logic (4 edge cases)
  • Bounded abort + worker name preservation (structural)
  • Shell zombie timeout (updated + new tests)

Recommendation

Approve — All findings from all review rounds have been addressed. The PR is ready to merge.

PureWeen and others added 11 commits April 6, 2026 09:46
…ious) duplicates

When LoadOrganization's scattered team reconstruction moves sessions to a
new multi-agent group, the sessions get recreated with new SessionIds.
On the next save+restore, MergeSessionEntries saw a name collision between
the old persisted entry (old GroupId) and the new active entry (new GroupId)
and renamed the old one to '(previous)' — creating confusing phantom workers.

Fix: when a name collision occurs and the persisted entry has a DIFFERENT
GroupId than the active entry, silently drop the persisted entry. The session
was moved between groups and its history was already migrated via
CopyEventsToNewSession. The '(previous)' rename is preserved for same-group
collisions (reconnect, lazy-resume) where the old entry may have unique history.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stale/orphaned shells reported by the CLI in session.idle backgroundTasks
were never expired, causing sessions to stay stuck in IDLE-DEFER until
the user manually hit Stop. This happened when the CLI reported shells
as 'active' that were actually dead or detached.

Apply the same SubagentZombieTimeoutMinutes (20 min) expiry to shells
that already applies to agents. After the threshold, both are treated
as zombies and the session is allowed to complete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FlushCurrentResponse (called on TurnEnd and IDLE-DEFER) already has a dedup
guard, but CompleteResponse did not. When IDLE-DEFER flushes text to History
and then the session receives the same content again (SDK replay after
reconnect, IDLE-DEFER re-arm), CompleteResponse would add an identical
second message. Now it checks against the last assistant message before
adding, matching the same pattern used in FlushCurrentResponse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FlushCurrentResponse was resetting HasReceivedDeltasThisTurn=false after
each mid-turn flush. This allowed the next AssistantMessageEvent (which
contains the full turn content) to slip through the delta guard and
re-append content already flushed to History — producing visible duplicate
messages in the chat.

The flag should only be reset at turn boundaries (AssistantTurnStartEvent),
not on every flush. Removing the mid-turn reset prevents the full-message
event from re-adding already-flushed content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nd runtime evidence

Implementer Step 3: explicitly require writing/updating tests for new
behavior (not just running existing ones), and verify observable runtime
output directly rather than trusting tests alone.

Challenger Step 4: remove "when possible" hedge on runtime validation,
add explicit prohibition on approving UI features based on unit tests
alone, and require workers to state when they cannot verify something
at runtime rather than silently omitting the gap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Scope assistant replay dedup to the current turn so legitimate repeated replies still persist, and give background shells a longer zombie window with regression coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ClearFlushedReplayDedup is called on the SDK background thread
(ToolExecutionStartEvent, AssistantTurnStartEvent) while
WasResponseAlreadyFlushedThisTurn reads the same fields on the UI
thread. On ARM64 (Apple Silicon), the weak memory model can cause
stale reads that silently drop legitimate assistant content.

Convert LastFlushedResponseSegment and FlushedReplayDedupArmed from
plain auto-properties to volatile fields, matching the existing
HasUsedToolsThisTurn pattern. Clear armed before segment in
ClearFlushedReplayDedup so concurrent readers short-circuit safely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 4af7581 into main Apr 6, 2026
@PureWeen PureWeen deleted the shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive branch April 6, 2026 20:08
PureWeen added a commit that referenced this pull request Apr 6, 2026
PR #526 extracted SubagentDeferStartedAtTicks stamping into
RefreshDeferredBackgroundTaskTracking() and changed the case pattern
to use a named variable. Update the source-scanning tests:
- Remove trailing colon from case pattern search (handler uses
  pattern variable now)
- Accept RefreshDeferredBackgroundTaskTracking call as valid
  stamping mechanism (uses Interlocked.Exchange internally)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant