Commit 023300c
committed
Fix stdio client shutdown: cancellation deadlock, orphaned children, spurious kills
Four user-facing bugs in the stdio client's shutdown path, all rooted in
the same design problem — the shutdown sequence depended on pipe state
and was not protected from cancellation:
- Cancelling stdio_client (a client timeout, app shutdown) skipped the
entire shutdown sequence: the first await in the finally block
re-raised the pending cancellation, so stdin was never closed
gracefully and the process tree was never terminated. Control then
fell through to anyio's Process.aclose, whose shielded wait only
returns once every pipe has closed - and a pipe inherited by a
grandchild (npx- and uv-style servers always have one) never closes,
so the client deadlocked forever. The shutdown now runs inside a
shielded cancel scope with every wait time-bounded, and never relies
on a pipe-gated wait.
- The grace-period check used process.wait(), which on the asyncio
backend resolves only when the process has exited AND its pipes have
closed. A well-behaved server that exited instantly on stdin closure
but left a background child holding stdout was misclassified as hung,
burned the full grace period, and got its tree terminated with a
spurious warning. The wait now polls returncode, which reflects
process death alone.
- Process-tree termination derived the group ID with os.getpgid(pid),
which fails once the leader has been reaped even while its
descendants are alive — and the fallback then "terminated" the dead
leader, leaking every descendant. Since the process is spawned with
start_new_session=True, the pgid is by definition the leader's pid;
use it directly, and treat ProcessLookupError from killpg as "group
already gone" rather than a reason to fall back.
- Writing to a dead server's stdin surfaced a raw backend exception
(ConnectionResetError inside an exception group) to the caller
instead of a clean closed-stream signal. stdin_writer now treats
BrokenResourceError and ConnectionError like ClosedResourceError.
Windows fixes, by analysis (CI must validate): FallbackProcess.wait()
ran Popen.wait() in a non-cancellable worker thread, so the timeout
around the grace period could never fire and shutdown could hang
indefinitely — it now polls cancellably, and returncode reflects death
without requiring a wait. terminate_windows_process_tree dropped its
timeout_seconds parameter, which was documented but never used (Job
Object termination is immediate; the docstring now says so).
Cleanup in the same files: the escalation now lives in one function
with two named timeouts instead of three hardcoded 2.0s; the Job Object
is tracked in a WeakKeyDictionary instead of a monkey-patched private
attribute on anyio's Process; the deprecated terminate_windows_process
is removed (migration.md updated); the always-true CREATE_NO_WINDOW
hasattr dance and a retry path that double-spawned on spawn failure are
gone; the client's JSON parse error handler catches ValueError instead
of Exception.1 parent 8a6abc0 commit 023300c
4 files changed
Lines changed: 480 additions & 256 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
108 | 140 | | |
109 | 141 | | |
110 | 142 | | |
| |||
0 commit comments