Conversation
…orker timeout Two bugs observed when worker-2 got stuck on a bash tool call (wc -l that hung) during a Reflect loop run, causing the orchestrator to be idle for 33+ minutes until the server's session idle-timeout killed it: **Bug 1 — ForceCompleteProcessingAsync doesn't clear SDK pending tool state** When the pre-dispatch 150s timeout fires on a worker with an active tool call, ForceCompleteProcessingAsync resets PolyPilot's tracking state but never calls AbortAsync on the CLI session. The CLI is still blocked waiting for tool results that will never arrive. The next SendAsync succeeds at the transport level but the CLI silently drops it — the message is queued behind the stuck tool and never processed. Fix: capture hadActiveTool before clearing ActiveToolCallCount, then call session.AbortAsync() after the UI-thread cleanup when true. This is the same approach the RESUME-QUIESCE path in CopilotService.Persistence.cs uses. AbortAsync failure is non-fatal and logged. **Bug 2 — Reflect loop has no collection timeout on worker wait** SendViaOrchestratorAsync (non-reflect) wraps Task.WhenAll(workerTasks) with OrchestratorCollectionTimeout (15 min) and force-completes stuck workers. SendViaOrchestratorReflectAsync had no equivalent — a single stuck worker blocked the entire loop indefinitely, starving the orchestrator's CLI session of activity until the server killed it (~35 min idle timeout), leaving the reflect loop hanging with no recovery for hours. Fix: apply the same bounded-wait + force-complete pattern to the reflect loop's worker dispatch. Mirrors the non-reflect path exactly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
🔍 Multi-Model Code Review — PR #526 (Re-review #5 — Updated)PR: fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout Status of All Findings
Latest Fix:
|
…ious) duplicates When LoadOrganization's scattered team reconstruction moves sessions to a new multi-agent group, the sessions get recreated with new SessionIds. On the next save+restore, MergeSessionEntries saw a name collision between the old persisted entry (old GroupId) and the new active entry (new GroupId) and renamed the old one to '(previous)' — creating confusing phantom workers. Fix: when a name collision occurs and the persisted entry has a DIFFERENT GroupId than the active entry, silently drop the persisted entry. The session was moved between groups and its history was already migrated via CopyEventsToNewSession. The '(previous)' rename is preserved for same-group collisions (reconnect, lazy-resume) where the old entry may have unique history. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stale/orphaned shells reported by the CLI in session.idle backgroundTasks were never expired, causing sessions to stay stuck in IDLE-DEFER until the user manually hit Stop. This happened when the CLI reported shells as 'active' that were actually dead or detached. Apply the same SubagentZombieTimeoutMinutes (20 min) expiry to shells that already applies to agents. After the threshold, both are treated as zombies and the session is allowed to complete. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FlushCurrentResponse (called on TurnEnd and IDLE-DEFER) already has a dedup guard, but CompleteResponse did not. When IDLE-DEFER flushes text to History and then the session receives the same content again (SDK replay after reconnect, IDLE-DEFER re-arm), CompleteResponse would add an identical second message. Now it checks against the last assistant message before adding, matching the same pattern used in FlushCurrentResponse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FlushCurrentResponse was resetting HasReceivedDeltasThisTurn=false after each mid-turn flush. This allowed the next AssistantMessageEvent (which contains the full turn content) to slip through the delta guard and re-append content already flushed to History — producing visible duplicate messages in the chat. The flag should only be reset at turn boundaries (AssistantTurnStartEvent), not on every flush. Removing the mid-turn reset prevents the full-message event from re-adding already-flushed content. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nd runtime evidence Implementer Step 3: explicitly require writing/updating tests for new behavior (not just running existing ones), and verify observable runtime output directly rather than trusting tests alone. Challenger Step 4: remove "when possible" hedge on runtime validation, add explicit prohibition on approving UI features based on unit tests alone, and require workers to state when they cannot verify something at runtime rather than silently omitting the gap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Scope assistant replay dedup to the current turn so legitimate repeated replies still persist, and give background shells a longer zombie window with regression coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ClearFlushedReplayDedup is called on the SDK background thread (ToolExecutionStartEvent, AssistantTurnStartEvent) while WasResponseAlreadyFlushedThisTurn reads the same fields on the UI thread. On ARM64 (Apple Silicon), the weak memory model can cause stale reads that silently drop legitimate assistant content. Convert LastFlushedResponseSegment and FlushedReplayDedupArmed from plain auto-properties to volatile fields, matching the existing HasUsedToolsThisTurn pattern. Clear armed before segment in ClearFlushedReplayDedup so concurrent readers short-circuit safely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #526 extracted SubagentDeferStartedAtTicks stamping into RefreshDeferredBackgroundTaskTracking() and changed the case pattern to use a named variable. Update the source-scanning tests: - Remove trailing colon from case pattern search (handler uses pattern variable now) - Accept RefreshDeferredBackgroundTaskTracking call as valid stamping mechanism (uses Interlocked.Exchange internally) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Two bugs observed when a worker got stuck on a bash tool call during a Reflect loop run, causing the orchestrator to sit idle for 33+ minutes until the server's session idle-timeout killed it and left the loop hanging with no recovery.
Bug 1 —
ForceCompleteProcessingAsyncdoesn't clear SDK pending tool stateWhen the pre-dispatch 150s timeout fires on a worker with an active tool call,
ForceCompleteProcessingAsyncresets PolyPilot's tracking state (ActiveToolCallCount,IsProcessing, etc.) but never callsAbortAsyncon the CLI session. The CLI is still blocked waiting for tool results that will never arrive. The nextSendAsyncsucceeds at the transport level but the CLI silently drops it — the message is queued behind the stuck tool and never processed, soSendPromptAndWaitAsynchangs indefinitely.Fix: capture
hadActiveToolbefore clearingActiveToolCallCount, then callsession.AbortAsync()after the UI-thread cleanup when true. This mirrors the RESUME-QUIESCE abort inCopilotService.Persistence.cs(line 496).AbortAsyncfailure is non-fatal and logged.Bug 2 — Reflect loop has no collection timeout on worker wait
SendViaOrchestratorAsync(non-reflect path) wrapsTask.WhenAll(workerTasks)withOrchestratorCollectionTimeout(15 min) and force-completes stuck workers.SendViaOrchestratorReflectAsynchad no equivalent — a single stuck worker blocked the entire reflect loop indefinitely. This starved the orchestrator's CLI session of activity until the server killed it (~35-min idle timeout), and left the loop hanging with no recovery path for hours.Fix: apply the same bounded-wait + force-complete pattern to the reflect loop's worker dispatch. Exact mirror of the non-reflect path.
Test plan
OrchestratorCollectionTimeout(15 min) and continues to the synthesis phase with partial resultsAbortAsyncis called; next dispatch to that worker succeeds (not silently dropped)AbortAsyncfailure is logged and non-fatal; dispatch continues normally🤖 Generated with Claude Code