fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout by PureWeen · Pull Request #526 · PureWeen/PolyPilot

PureWeen · 2026-04-06T13:27:56Z

Summary

Two bugs observed when a worker got stuck on a bash tool call during a Reflect loop run, causing the orchestrator to sit idle for 33+ minutes until the server's session idle-timeout killed it and left the loop hanging with no recovery.

Bug 1 — `ForceCompleteProcessingAsync` doesn't clear SDK pending tool state

When the pre-dispatch 150s timeout fires on a worker with an active tool call, ForceCompleteProcessingAsync resets PolyPilot's tracking state (ActiveToolCallCount, IsProcessing, etc.) but never calls AbortAsync on the CLI session. The CLI is still blocked waiting for tool results that will never arrive. The next SendAsync succeeds at the transport level but the CLI silently drops it — the message is queued behind the stuck tool and never processed, so SendPromptAndWaitAsync hangs indefinitely.

Fix: capture hadActiveTool before clearing ActiveToolCallCount, then call session.AbortAsync() after the UI-thread cleanup when true. This mirrors the RESUME-QUIESCE abort in CopilotService.Persistence.cs (line 496). AbortAsync failure is non-fatal and logged.

Bug 2 — Reflect loop has no collection timeout on worker wait

SendViaOrchestratorAsync (non-reflect path) wraps Task.WhenAll(workerTasks) with OrchestratorCollectionTimeout (15 min) and force-completes stuck workers. SendViaOrchestratorReflectAsync had no equivalent — a single stuck worker blocked the entire reflect loop indefinitely. This starved the orchestrator's CLI session of activity until the server killed it (~35-min idle timeout), and left the loop hanging with no recovery path for hours.

Fix: apply the same bounded-wait + force-complete pattern to the reflect loop's worker dispatch. Exact mirror of the non-reflect path.

Test plan

Worker gets stuck mid-tool during a Reflect loop → loop force-completes after OrchestratorCollectionTimeout (15 min) and continues to the synthesis phase with partial results
Pre-dispatch 150s timeout fires on a worker with an active tool call → AbortAsync is called; next dispatch to that worker succeeds (not silently dropped)
AbortAsync failure is logged and non-fatal; dispatch continues normally

🤖 Generated with Claude Code

…orker timeout Two bugs observed when worker-2 got stuck on a bash tool call (wc -l that hung) during a Reflect loop run, causing the orchestrator to be idle for 33+ minutes until the server's session idle-timeout killed it: **Bug 1 — ForceCompleteProcessingAsync doesn't clear SDK pending tool state** When the pre-dispatch 150s timeout fires on a worker with an active tool call, ForceCompleteProcessingAsync resets PolyPilot's tracking state but never calls AbortAsync on the CLI session. The CLI is still blocked waiting for tool results that will never arrive. The next SendAsync succeeds at the transport level but the CLI silently drops it — the message is queued behind the stuck tool and never processed. Fix: capture hadActiveTool before clearing ActiveToolCallCount, then call session.AbortAsync() after the UI-thread cleanup when true. This is the same approach the RESUME-QUIESCE path in CopilotService.Persistence.cs uses. AbortAsync failure is non-fatal and logged. **Bug 2 — Reflect loop has no collection timeout on worker wait** SendViaOrchestratorAsync (non-reflect) wraps Task.WhenAll(workerTasks) with OrchestratorCollectionTimeout (15 min) and force-completes stuck workers. SendViaOrchestratorReflectAsync had no equivalent — a single stuck worker blocked the entire loop indefinitely, starving the orchestrator's CLI session of activity until the server killed it (~35 min idle timeout), leaving the reflect loop hanging with no recovery for hours. Fix: apply the same bounded-wait + force-complete pattern to the reflect loop's worker dispatch. Mirrors the non-reflect path exactly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PureWeen · 2026-04-06T13:53:33Z

🔍 Multi-Model Code Review — PR #526 (Re-review #5 — Updated)

PR: fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout
Branch: shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive
Tests: ✅ 3294 passed, 0 failed (local dotnet test)
CI: ⚠️ No checks reported on the branch (no CI configured for this branch)

Status of All Findings

#	Finding	Status
1	Unbounded `AbortAsync` in `ForceCompleteProcessingAsync`	✅ FIXED — bounded with 15s CTS
2	`MergeSessionEntries` dropping cross-group sessions without recovery proof	✅ FIXED — `RecoveredFromSessionId` guard
3	Worker names lost as `"unknown"` during timeout collection	✅ FIXED — indexed loop preserves `assignments[i].WorkerName`
4	Replay dedup too broad (cross-turn suppression)	✅ FIXED — now same-turn only via `LastFlushedResponseSegment`
5	`ClearFlushedReplayDedup` cross-thread race on ARM64	✅ FIXED — `d662403f6` converts both fields to `volatile`, orders writes safely

Latest Fix: `d662403f6`

The sole remaining finding from re-review #5 — unanimously flagged by all 3 reviewers — was that LastFlushedResponseSegment and FlushedReplayDedupArmed were plain auto-properties written from the SDK background thread (ToolExecutionStartEvent, AssistantTurnStartEvent) but read on the UI thread. On ARM64 (Apple Silicon), the weak memory model could cause stale reads leading to silent content loss.

Fix applied:

Converted both fields from auto-properties to volatile fields (matching the existing HasUsedToolsThisTurn pattern on the adjacent line)
Reordered ClearFlushedReplayDedup to clear armed before segment — concurrent readers check armed first, so seeing false short-circuits safely even if segment has not been cleared yet
All 3294 tests pass after the change

Test Coverage

Comprehensive coverage across the PR:

Same-turn replay dedup (flush + complete, multi-paragraph)
Cross-turn identical replies (must persist)
Same-turn after tool boundary (must persist after ClearFlushedReplayDedup)
Streaming duplicate suppression, draft preservation (source-level)
RecoveredFromSessionId merge logic (4 edge cases)
Bounded abort + worker name preservation (structural)
Shell zombie timeout (updated + new tests)

Recommendation

✅ Approve — All findings from all review rounds have been addressed. The PR is ready to merge.

…ious) duplicates When LoadOrganization's scattered team reconstruction moves sessions to a new multi-agent group, the sessions get recreated with new SessionIds. On the next save+restore, MergeSessionEntries saw a name collision between the old persisted entry (old GroupId) and the new active entry (new GroupId) and renamed the old one to '(previous)' — creating confusing phantom workers. Fix: when a name collision occurs and the persisted entry has a DIFFERENT GroupId than the active entry, silently drop the persisted entry. The session was moved between groups and its history was already migrated via CopyEventsToNewSession. The '(previous)' rename is preserved for same-group collisions (reconnect, lazy-resume) where the old entry may have unique history. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stale/orphaned shells reported by the CLI in session.idle backgroundTasks were never expired, causing sessions to stay stuck in IDLE-DEFER until the user manually hit Stop. This happened when the CLI reported shells as 'active' that were actually dead or detached. Apply the same SubagentZombieTimeoutMinutes (20 min) expiry to shells that already applies to agents. After the threshold, both are treated as zombies and the session is allowed to complete. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FlushCurrentResponse (called on TurnEnd and IDLE-DEFER) already has a dedup guard, but CompleteResponse did not. When IDLE-DEFER flushes text to History and then the session receives the same content again (SDK replay after reconnect, IDLE-DEFER re-arm), CompleteResponse would add an identical second message. Now it checks against the last assistant message before adding, matching the same pattern used in FlushCurrentResponse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FlushCurrentResponse was resetting HasReceivedDeltasThisTurn=false after each mid-turn flush. This allowed the next AssistantMessageEvent (which contains the full turn content) to slip through the delta guard and re-append content already flushed to History — producing visible duplicate messages in the chat. The flag should only be reset at turn boundaries (AssistantTurnStartEvent), not on every flush. Removing the mid-turn reset prevents the full-message event from re-adding already-flushed content. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nd runtime evidence Implementer Step 3: explicitly require writing/updating tests for new behavior (not just running existing ones), and verify observable runtime output directly rather than trusting tests alone. Challenger Step 4: remove "when possible" hedge on runtime validation, add explicit prohibition on approving UI features based on unit tests alone, and require workers to state when they cannot verify something at runtime rather than silently omitting the gap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Scope assistant replay dedup to the current turn so legitimate repeated replies still persist, and give background shells a longer zombie window with regression coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ClearFlushedReplayDedup is called on the SDK background thread (ToolExecutionStartEvent, AssistantTurnStartEvent) while WasResponseAlreadyFlushedThisTurn reads the same fields on the UI thread. On ARM64 (Apple Silicon), the weak memory model can cause stale reads that silently drop legitimate assistant content. Convert LastFlushedResponseSegment and FlushedReplayDedupArmed from plain auto-properties to volatile fields, matching the existing HasUsedToolsThisTurn pattern. Clear armed before segment in ClearFlushedReplayDedup so concurrent readers short-circuit safely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PR #526 extracted SubagentDeferStartedAtTicks stamping into RefreshDeferredBackgroundTaskTracking() and changed the case pattern to use a named variable. Update the source-scanning tests: - Remove trailing colon from case pattern search (handler uses pattern variable now) - Accept RefreshDeferredBackgroundTaskTracking call as valid stamping mechanism (uses Interlocked.Exchange internally) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 11 commits April 6, 2026 09:46

fix: suppress stale streaming duplicates after flush

c90f091

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: preserve live drafts during UI rerenders

999ebd7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: tighten replay dedup and shell liveness

9e618ef

Scope assistant replay dedup to the current turn so legitimate repeated replies still persist, and give background shells a longer zombie window with regression coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: bound forced abort and preserve recovered sessions

fba541e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: refine replay dedup and worker failure labels

12e0823

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen merged commit 4af7581 into main Apr 6, 2026

PureWeen deleted the shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive branch April 6, 2026 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout#526

fix: abort stuck SDK tool state on force-complete; add reflect loop worker timeout#526
PureWeen merged 12 commits intomainfrom
shneuvil/fix-stuck-worker-dispatch-and-orchestrator-keepalive

PureWeen commented Apr 6, 2026

Uh oh!

PureWeen commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Apr 6, 2026

Summary

Bug 1 — ForceCompleteProcessingAsync doesn't clear SDK pending tool state

Bug 2 — Reflect loop has no collection timeout on worker wait

Test plan

Uh oh!

PureWeen commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Multi-Model Code Review — PR #526 (Re-review #5 — Updated)

Status of All Findings

Latest Fix: d662403f6

Test Coverage

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — `ForceCompleteProcessingAsync` doesn't clear SDK pending tool state

PureWeen commented Apr 6, 2026 •

edited

Loading

Latest Fix: `d662403f6`