Skip to content

MCP server process death is never detected: no reconnect, initialized stays True, tool calls raise raw anyio.ClosedResourceError #6291

Description

@Jonnyton

Bug Description

On my own MCP server project I had an issue of always hitting max tool calls before I simplified the tool list, so I was curious if there were any similar tool list issues here. That's when I noticed tools can go dead mid-session: if the MCP server process dies, nothing notices, and every tool call after that fails for the rest of the session. On a long voice call that's going to happen eventually, servers restart.

What actually happens: the connection task never finds out the transport is gone, initialized keeps returning True, and every tool call for the rest of the session raises a raw anyio.ClosedResourceError with no message.

Root cause (traced on main, livekit-agents/livekit/agents/llm/mcp.py):

  1. _run_client (~L115) holds the ClientSession open and parks on await self._closing_ev.wait(). When the server process dies the transport streams close underneath it, but nothing links stream death to that wait: the connection task stays parked forever (step 4 in the repro output, task done=False).
  2. Because the task never unwinds, the finally: cleanup (~L138-141, self._client = None) never runs in this death mode, so the guarded ToolError("Tool invocation failed: internal service is unavailable...") (~L171, gated on _client is None) is never reached either.
  3. Tool calls therefore fall through to await self._client.call_tool(...) (~L175) and raise a raw anyio.ClosedResourceError (empty message) into the agent layer, i.e. into the LLM's tool-call turn.
  4. initialized (~L96) is just self._client is not None, so it keeps reporting True after death.
  5. Nothing observes _client_task except aclose() (~L203-205), and nothing ever re-runs initialize(): no reconnect, no backoff, no health event. Tools registered on an AgentSession at startup remain permanently dead handles for the rest of the session.

Expected Behavior

I expected some kind of reconnect attempt, or at least an error. Found "internal service is unavailable" in the ToolErrors sitting in mcp.py, but it never triggers when the server dies. Worse, it's a silent error, because initialized keeps saying True, and tool calls just throw anyio.ClosedResourceError with no message. Nothing worse than a silent error. At least it should fail loudly, or better yet it should still work.

Reproduction Steps

  1. Save the two files below and pip install "livekit-agents[mcp]"
  2. python repro.py
  3. Watch steps 4-6 of the output: the server process is dead, initialized still says True, and tool calls raise raw anyio errors

dying_server.py (a ping tool and a crash tool that exits the process, simulating any server death: crash, redeploy, OOM kill):

import os
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("dying-server")

@mcp.tool()
def ping() -> str:
    """Returns pong."""
    return "pong"

@mcp.tool()
def crash() -> str:
    """Simulates the server process dying (crash, restart, OOM kill...)."""
    os._exit(1)

if __name__ == "__main__":
    mcp.run(transport="stdio")

repro.py:

import asyncio, sys
from pathlib import Path

from livekit.agents.llm.mcp import MCPServerStdio

SERVER = str(Path(__file__).parent / "dying_server.py")

async def main() -> None:
    server = MCPServerStdio(command=sys.executable, args=[SERVER])
    await server.initialize()
    tools = await server.list_tools()
    print(f"1) initialized={server.initialized}, tools={[t.info.name for t in tools]}")

    ping = next(t for t in tools if t.info.name == "ping")
    crash = next(t for t in tools if t.info.name == "crash")

    print("2) ping ->", await ping(raw_arguments={}))

    try:
        await crash(raw_arguments={})   # the server process exits here
    except Exception as e:
        print(f"3) crash tool raised {type(e).__name__}: {str(e)[:60]}")

    await asyncio.sleep(1.0)
    print(f"4) one second after server death: initialized={server.initialized} "
          f"(connection task done={server._client_task.done()})")

    for n in (5, 6):
        try:
            await ping(raw_arguments={})
        except Exception as e:
            print(f"{n}) ping -> {type(e).__module__}.{type(e).__name__}: {str(e)[:70]!r}")
        await asyncio.sleep(0.5)

    print("7) no reconnect was attempted; initialized still", server.initialized)

asyncio.run(main())

Output:

1) initialized=True, tools=['ping', 'crash']
2) ping -> {"type":"text","text":"pong","annotations":null,"meta":null}
3) crash tool raised McpError: Connection closed
4) one second after server death: initialized=True (connection task done=False)
5) ping -> anyio.ClosedResourceError: ''
6) ping -> anyio.ClosedResourceError: ''
7) no reconnect was attempted; initialized still True

Step 4 is the heart of it: a full second after the server process is gone, the connection task has not observed the death.

Operating System

Linux (repro is OS-independent)

Models Used

None involved (pure MCP client lifecycle)

Package Versions

livekit-agents 1.6.4 (also reproduced identically against main @ e2b2d09, 2026-07-02)
mcp: stdio transport (streamable-HTTP shares the same lifecycle path)
Python 3.10.12

Session/Room/Call IDs

No response

Proposed Solution

The fix seemed simple and partly already written, it just never runs. Something needs to watch the connection, then it can at least call the proper error, failing loudly. Once the error is tracked the obvious next step is reconnect.

Additional Context

No response

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions