Skip to content

fix(mcp): reap serve --mcp child when parent is SIGKILL'd (#277)#286

Open
evanclan wants to merge 1 commit into
colbymchenry:mainfrom
evanclan:fix/mcp-ppid-watchdog-277
Open

fix(mcp): reap serve --mcp child when parent is SIGKILL'd (#277)#286
evanclan wants to merge 1 commit into
colbymchenry:mainfrom
evanclan:fix/mcp-ppid-watchdog-277

Conversation

@evanclan
Copy link
Copy Markdown

Summary

Adds a process.ppid watchdog to MCPServer.start() so a codegraph serve --mcp child terminates when its MCP host is force-killed. Resolves #277.

Problem

The existing shutdown path (src/mcp/index.ts) leans entirely on signal handlers and stdin close events:

```75:79:src/mcp/index.ts
process.on('SIGINT', () => this.stop());
process.on('SIGTERM', () => this.stop());

// When the parent process (Claude Code) exits, stdin closes.
// Detect this and shut down gracefully to prevent orphaned processes.

```

On Linux that's not enough when the host (Claude Code, opencode, …) is SIGKILL'd by the OOM killer / a `kill -9` / a container teardown:

  • The kernel does not propagate parent death to children.
  • The child gets reparented to init/systemd.
  • Whether the half-closed stdio pipe surfaces `end` / `close` to Node depends on whether anyone else still holds the write end. In the reporter's environment it didn't fire — three orphan `codegraph serve --mcp` processes were pinned across sessions, each holding ~440k inotify watches (which then transitively tripped #276's watch-budget exhaustion in Next.js / IDEs).

Solution

Capture `process.ppid` once at construction, then `setInterval` (default 5s, `.unref()`'d) to check it. The moment it diverges from the baseline, we know the original parent has died and we tear down cleanly:

```text
[CodeGraph MCP] Parent process exited (ppid 9177 -> 1); shutting down.
```

Cross-platform: reparenting changes `process.ppid` on Linux and macOS; on Windows the value drops to 0 once the parent is gone, which also trips the check.

Knobs:

  • `CODEGRAPH_PPID_POLL_MS` env var overrides the poll interval (default 5000ms).
  • `CODEGRAPH_PPID_POLL_MS=0` disables the watchdog entirely — escape hatch for embedded scenarios where the parent deliberately re-parents the server.

`stop()` is now guarded by an idempotency flag so the watchdog can't race the existing stdin-close handlers and double-close the SQLite handle / transport.

Out of scope

  • The companion inotify-watch exhaustion (#276) — this PR removes the orphan-server multiplier on the watch budget, but the underlying `fs.watch({ recursive: true })` registering-everything behavior is its own change.
  • `prctl(PR_SET_PDEATHSIG)` (issue's suggested fix git-hook potential issue when codegraph is not installed globally #2) — pure-JS polling is enough for a failure mode that's already rare, no native build dependency needed.

Test plan

  • New `tests/mcp-ppid-watchdog.test.ts` stands up a four-tier process tree (vitest → wrapper → {stdin-holder, codegraph}) and SIGKILL's the wrapper. The stdin-holder is a long-lived sibling whose `stdout` pipe is dup'd into codegraph's `stdin`, so the wrapper's death does not transitively close codegraph's stdin. That isolates the watchdog from the pre-existing stdin-close path — confirmed by temporarily disabling the watchdog (test fails) vs leaving it in (test passes).
  • 5/5 consecutive local runs at ~925ms each, no flakes.
  • `npx vitest run tests/mcp-ppid-watchdog.test.ts tests/mcp-initialize.test.ts tests/mcp-roots.test.ts` — 6/6 passing.
  • Full `npm test`: 718 passing, 5 failing in `git-hooks.test.ts` / `watcher.test.ts` — both pre-existing on `main` (verified by re-running on `origin/main` HEAD) and unrelated to this change.
  • `npx tsc --noEmit` clean.

Related

Made with Cursor

…ry#277)

On Linux the kernel doesn't propagate parent death to children, and the
existing `stdin.on('end' | 'close')` handlers don't always fire when an
MCP host (Claude Code, opencode, …) is force-killed by the OOM killer,
a `kill -9`, or a container teardown. The reporter in colbymchenry#277 ended up
with three orphan `codegraph serve --mcp` processes pinned across
sessions, each holding its own inotify watch set (~440k watches),
which then tripped colbymchenry#276's watch-budget exhaustion in unrelated tools
(Next.js, IDEs).

Capture `process.ppid` once at server construction, poll it on a
`setInterval`, and shut down the moment it diverges from that baseline.
The interval is `.unref()`'d so it never holds the event loop open on
its own; the poll period is `CODEGRAPH_PPID_POLL_MS` (default 5000ms,
`0` disables for embedded hosts that re-parent on purpose).

The regression test stands up a four-tier process tree
(vitest → wrapper → {stdin-holder, codegraph}) so the wrapper's
SIGKILL doesn't transitively close codegraph's stdin (sibling
stdin-holder keeps the pipe's write-end alive). That isolates the
watchdog from the pre-existing stdin-close path: the test fails
without the watchdog and passes with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

serve --mcp is not reaped when the parent Claude Code process is SIGKILL'd (Linux) git-hook potential issue when codegraph is not installed globally

1 participant