Skip to content

fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155

Closed
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/subagent-stack-overflow
Closed

fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/subagent-stack-overflow

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented Jun 1, 2026

Summary

  • Fix openhuman-core crashing (SIGABRT / stack overflow) whenever the orchestrator delegates to a sub-agent.
  • This is the root cause of the Playwright web E2E lane 1/4 failing on every PR (and likely real user crashes with large tool surfaces).
  • One-line runtime change: give the JSON-RPC server's tokio workers a 16 MiB stack.

Problem

openhuman-core run builds its tokio runtime with the default 2 MiB per-worker-thread stack (src/core/cli.rs). A single agent turn is already a huge async state machine (17.6 KB system prompt + hundreds of tool specs + the nested provider/tool loop). When the orchestrator delegates to a sub-agent (e.g. a research tool → subagent_runner dispatches researcher at spawn_depth=1), a second full turn runs one level down on the same worker stack, overflowing 2 MiB and aborting the entire process:

[agent] executing tool: research
[subagent_runner] dispatching agent_id=researcher … spawn_depth=1 max_spawn_depth=3
thread 'tokio-rt-worker' (…) has overflowed its stack
fatal runtime error: stack overflow, aborting

The code already tries to mitigate this by Box::pin-ing the inner sub-agent future (subagent_runner/ops.rs, ref #2234), but the parent orchestrator turn is itself on the worker stack, so one level of nesting still overflows.

In CI this shows up as Playwright lane 1/4: chat-harness-subagent times out (the orchestrator's final text never renders because the core died mid-turn), then ~26 subsequent specs cascade-fail with ECONNREFUSED 127.0.0.1:17788 — they share the now-dead core (workers: 1, no auto-restart). So "27 failed" is really 1 crash + 26 collateral.

This is not test-only: any real deployment (desktop-via-CLI, docker, cloud) can hit the same abort on a legitimate sub-agent delegation with a large tool surface.

Solution

Set thread_stack_size(16 * 1024 * 1024) on the serve runtime so a subagent-nested agent turn fits comfortably:

let rt = tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .thread_stack_size(16 * 1024 * 1024)
    .build()?;

Verification (local, real stack, isolated ports)

Built the standalone core + web bundle, ran the real openhuman-core + mock backend + web host on isolated ports (core 27788, mock 28473, web 4273), and drove the chat-harness-subagent spec:

Before After
Spec ✘ fails (~52s, canary never renders) passes (7.4s)
Core process Abort trap: 6overflowed its stack stays alive; no overflow in core.log

With the core no longer dying, the ECONNREFUSED cascade across lane 1/4 disappears.

Submission Checklist

  • Tests added or updated — N/A: covered by the existing chat-harness-subagent Playwright spec, which now passes; this is a runtime-config fix with no new unit-testable surface.
  • Diff coverage ≥ 80%N/A: the changed lines are tokio runtime-builder config, exercised by the e2e lane (not unit-coverable).
  • Coverage matrix updated — N/A: no feature row added/removed/renamed.
  • All affected feature IDs listed under ## RelatedN/A.
  • No new external network dependencies introduced.
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A.
  • Linked issue closed via Closes #NNNN/A: no tracking issue.

Impact

  • Platform: the standalone/CLI/JSON-RPC server (openhuman-core run) — used by the Playwright web E2E lane, docker, and cloud deployments. Memory: +14 MiB of reserved (not committed) stack per worker thread.
  • CI: fixes Playwright lane 1/4 (and removes the cascade that masked it).
  • Product: prevents a real core crash on sub-agent delegation under a large tool surface.

Related

Summary by CodeRabbit

  • Bug Fixes
    • Fixed process crashes that could occur during complex nested operations by improving internal stack management for better reliability.

… subagent delegation

The standalone `openhuman-core run` server builds its tokio runtime with the
default 2 MiB per-worker-thread stack. A single agent turn is already a very
large async state machine (system prompt + hundreds of tool specs + the nested
provider/tool loop); delegating to a sub-agent runs another full turn one level
down. Even with the inner sub-agent future boxed (`subagent_runner::ops`, see
tinyhumansai#2234), that nesting overflows the 2 MiB stack and aborts the whole process:

    thread 'tokio-rt-worker' (...) has overflowed its stack
    fatal runtime error: stack overflow, aborting        (SIGABRT)

This takes the JSON-RPC server down mid-request. In the Playwright web E2E lane
it manifests as `chat-harness-subagent` timing out (the orchestrator's final
text never renders) followed by a cascade of `ECONNREFUSED` failures across
every subsequent spec in the worker, because they all share the now-dead core.

Set `thread_stack_size(16 MiB)` on the serve runtime so a subagent-nested agent
turn fits comfortably.

Reproduced and verified locally by driving the real `openhuman-core` + mock +
web stack on isolated ports and running the subagent spec:
- before: core aborts ("overflowed its stack"), spec fails at ~52s
- after:  core stays alive, spec passes in 7.4s, no overflow in core.log

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team June 1, 2026 14:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR increases the per-worker thread stack size in the Tokio multi-threaded runtime from the default 2 MiB to 16 MiB. A comment documents that delegating to nested "agent turns" in async contexts can cause stack overflow, necessitating this runtime configuration adjustment in run_server_command.

Changes

Tokio Stack Configuration

Layer / File(s) Summary
Runtime stack size configuration
src/core/cli.rs
The Tokio runtime builder configures thread_stack_size(16 * 1024 * 1024) with an explanatory comment describing that nested agent turns can overflow the default 2 MiB worker-thread stack.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2069: Both PRs directly address Tokio worker stack overflow by increasing the multi-thread runtime's per-worker stack size (thread_stack_size)—this PR updates run_server_command in src/core/cli.rs, while the related PR sets a custom runtime in app/src-tauri/src/lib.rs::run() with a larger stack.

Suggested labels

working

Poem

🐰 A rabbit built a stack so deep,
With 16 megs to safely keep,
Agent turns would overflow before,
Now the threads can ask for more! 🏗️

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: increasing worker stack size to 16 MiB to prevent crashes from subagent delegation in the JSON-RPC server.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label Jun 1, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanil-23 hey — code looks correct and the diagnosis is solid. The thread_stack_size(16 MiB) fix is the right call here; async state machines that deep will overflow 2 MiB once you add a second nesting level regardless of the inner Box::pin.

One thing worth following up on before this goes in: the PR notes the in-process desktop (Tauri) runtime doesn't get this same fix and is still exposed. That should be a filed issue before merge so it doesn't get lost — it's the same crash risk on a different codepath. If you can drop an issue number in the PR description or "Related" section that'd be ideal.

CI is failing (Rust Core Coverage) and Frontend Coverage is still pending. Once those are green I'll come back and approve. Let me know if you need anything.

@sanil-23
Copy link
Copy Markdown
Contributor Author

sanil-23 commented Jun 1, 2026

Thanks @graycyrus! Filed #3159 for the in-process desktop (Tauri) runtime exposure and linked it in the description's Related section. The failing Rust Core Coverage is a pre-existing chain unrelated to this diff — being addressed in #3156. Frontend Coverage and lane 1/4 are green here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants