Skip to content

Commit 023300c

Browse files
committed
Fix stdio client shutdown: cancellation deadlock, orphaned children, spurious kills
Four user-facing bugs in the stdio client's shutdown path, all rooted in the same design problem — the shutdown sequence depended on pipe state and was not protected from cancellation: - Cancelling stdio_client (a client timeout, app shutdown) skipped the entire shutdown sequence: the first await in the finally block re-raised the pending cancellation, so stdin was never closed gracefully and the process tree was never terminated. Control then fell through to anyio's Process.aclose, whose shielded wait only returns once every pipe has closed - and a pipe inherited by a grandchild (npx- and uv-style servers always have one) never closes, so the client deadlocked forever. The shutdown now runs inside a shielded cancel scope with every wait time-bounded, and never relies on a pipe-gated wait. - The grace-period check used process.wait(), which on the asyncio backend resolves only when the process has exited AND its pipes have closed. A well-behaved server that exited instantly on stdin closure but left a background child holding stdout was misclassified as hung, burned the full grace period, and got its tree terminated with a spurious warning. The wait now polls returncode, which reflects process death alone. - Process-tree termination derived the group ID with os.getpgid(pid), which fails once the leader has been reaped even while its descendants are alive — and the fallback then "terminated" the dead leader, leaking every descendant. Since the process is spawned with start_new_session=True, the pgid is by definition the leader's pid; use it directly, and treat ProcessLookupError from killpg as "group already gone" rather than a reason to fall back. - Writing to a dead server's stdin surfaced a raw backend exception (ConnectionResetError inside an exception group) to the caller instead of a clean closed-stream signal. stdin_writer now treats BrokenResourceError and ConnectionError like ClosedResourceError. Windows fixes, by analysis (CI must validate): FallbackProcess.wait() ran Popen.wait() in a non-cancellable worker thread, so the timeout around the grace period could never fire and shutdown could hang indefinitely — it now polls cancellably, and returncode reflects death without requiring a wait. terminate_windows_process_tree dropped its timeout_seconds parameter, which was documented but never used (Job Object termination is immediate; the docstring now says so). Cleanup in the same files: the escalation now lives in one function with two named timeouts instead of three hardcoded 2.0s; the Job Object is tracked in a WeakKeyDictionary instead of a monkey-patched private attribute on anyio's Process; the deprecated terminate_windows_process is removed (migration.md updated); the always-true CREATE_NO_WINDOW hasattr dance and a retry path that double-spawned on spawn failure are gone; the client's JSON parse error handler catches ValueError instead of Exception.
1 parent 8a6abc0 commit 023300c

4 files changed

Lines changed: 480 additions & 256 deletions

File tree

docs/migration.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,38 @@ The `headers`, `timeout`, `sse_read_timeout`, and `auth` parameters have been re
105105

106106
Note: `sse_client` retains its `headers`, `timeout`, `sse_read_timeout`, and `auth` parameters — only the streamable HTTP transport changed.
107107

108+
### `terminate_windows_process` removed
109+
110+
The deprecated `mcp.os.win32.utilities.terminate_windows_process` function has been
111+
removed. Process termination is handled internally by the `stdio_client` context
112+
manager; there is no replacement API. The Windows tree-termination helper
113+
`terminate_windows_process_tree` no longer accepts a `timeout_seconds` argument —
114+
the value was never used (Job Object termination is immediate).
115+
116+
### `stdio_client` no longer kills children of a gracefully-exited server on POSIX
117+
118+
When a server exits on its own after `stdio_client` closes its stdin, background
119+
child processes the server leaves behind are no longer killed on POSIX — their
120+
lifetime is the server's business. The old behavior was a side effect of a shutdown
121+
wait gated on the stdio pipes closing rather than on process exit: a child holding
122+
an inherited pipe made a well-behaved server look hung, so its whole process tree
123+
was killed. A server that does not exit within the grace period is still terminated
124+
along with its entire process group. On Windows, children stay in the server's Job
125+
Object and are still killed at shutdown — now deterministically when the job handle
126+
is closed, rather than whenever the handle happened to be garbage-collected.
127+
128+
If you relied on `stdio_client` killing everything the server spawned, make the
129+
server terminate its own children on shutdown (its stdin reaching EOF is the
130+
shutdown signal), or clean up the process tree from the host application after
131+
`stdio_client` exits.
132+
133+
Two related shutdown refinements: `stdio_client` now closes its end of the pipes
134+
deterministically at shutdown, so a surviving child that keeps writing to an
135+
inherited stdout receives `EPIPE`/`SIGPIPE` once the client is gone (previously the
136+
pipe lingered until garbage collection); and a failed write to a server that is
137+
still running now surfaces as a closed connection (`CONNECTION_CLOSED`) on the read
138+
side instead of leaving requests waiting indefinitely.
139+
108140
### Removed type aliases and classes
109141

110142
The following deprecated type aliases and classes have been removed from `mcp.types`:

0 commit comments

Comments
 (0)