fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155
fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155sanil-23 wants to merge 2 commits into
Conversation
… subagent delegation The standalone `openhuman-core run` server builds its tokio runtime with the default 2 MiB per-worker-thread stack. A single agent turn is already a very large async state machine (system prompt + hundreds of tool specs + the nested provider/tool loop); delegating to a sub-agent runs another full turn one level down. Even with the inner sub-agent future boxed (`subagent_runner::ops`, see tinyhumansai#2234), that nesting overflows the 2 MiB stack and aborts the whole process: thread 'tokio-rt-worker' (...) has overflowed its stack fatal runtime error: stack overflow, aborting (SIGABRT) This takes the JSON-RPC server down mid-request. In the Playwright web E2E lane it manifests as `chat-harness-subagent` timing out (the orchestrator's final text never renders) followed by a cascade of `ECONNREFUSED` failures across every subsequent spec in the worker, because they all share the now-dead core. Set `thread_stack_size(16 MiB)` on the serve runtime so a subagent-nested agent turn fits comfortably. Reproduced and verified locally by driving the real `openhuman-core` + mock + web stack on isolated ports and running the subagent spec: - before: core aborts ("overflowed its stack"), spec fails at ~52s - after: core stays alive, spec passes in 7.4s, no overflow in core.log Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR increases the per-worker thread stack size in the Tokio multi-threaded runtime from the default 2 MiB to 16 MiB. A comment documents that delegating to nested "agent turns" in async contexts can cause stack overflow, necessitating this runtime configuration adjustment in ChangesTokio Stack Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
@sanil-23 hey — code looks correct and the diagnosis is solid. The thread_stack_size(16 MiB) fix is the right call here; async state machines that deep will overflow 2 MiB once you add a second nesting level regardless of the inner Box::pin.
One thing worth following up on before this goes in: the PR notes the in-process desktop (Tauri) runtime doesn't get this same fix and is still exposed. That should be a filed issue before merge so it doesn't get lost — it's the same crash risk on a different codepath. If you can drop an issue number in the PR description or "Related" section that'd be ideal.
CI is failing (Rust Core Coverage) and Frontend Coverage is still pending. Once those are green I'll come back and approve. Let me know if you need anything.
|
Thanks @graycyrus! Filed #3159 for the in-process desktop (Tauri) runtime exposure and linked it in the description's Related section. The failing Rust Core Coverage is a pre-existing chain unrelated to this diff — being addressed in #3156. Frontend Coverage and lane 1/4 are green here. |
Summary
openhuman-corecrashing (SIGABRT / stack overflow) whenever the orchestrator delegates to a sub-agent.Problem
openhuman-core runbuilds its tokio runtime with the default 2 MiB per-worker-thread stack (src/core/cli.rs). A single agent turn is already a huge async state machine (17.6 KB system prompt + hundreds of tool specs + the nested provider/tool loop). When the orchestrator delegates to a sub-agent (e.g. aresearchtool →subagent_runnerdispatchesresearcheratspawn_depth=1), a second full turn runs one level down on the same worker stack, overflowing 2 MiB and aborting the entire process:The code already tries to mitigate this by
Box::pin-ing the inner sub-agent future (subagent_runner/ops.rs, ref #2234), but the parent orchestrator turn is itself on the worker stack, so one level of nesting still overflows.In CI this shows up as Playwright lane 1/4:
chat-harness-subagenttimes out (the orchestrator's final text never renders because the core died mid-turn), then ~26 subsequent specs cascade-fail withECONNREFUSED 127.0.0.1:17788— they share the now-dead core (workers: 1, no auto-restart). So "27 failed" is really 1 crash + 26 collateral.This is not test-only: any real deployment (desktop-via-CLI, docker, cloud) can hit the same abort on a legitimate sub-agent delegation with a large tool surface.
Solution
Set
thread_stack_size(16 * 1024 * 1024)on the serve runtime so a subagent-nested agent turn fits comfortably:Verification (local, real stack, isolated ports)
Built the standalone core + web bundle, ran the real
openhuman-core+ mock backend + web host on isolated ports (core27788, mock28473, web4273), and drove thechat-harness-subagentspec:Abort trap: 6—overflowed its stackWith the core no longer dying, the
ECONNREFUSEDcascade across lane 1/4 disappears.Submission Checklist
N/A: covered by the existingchat-harness-subagentPlaywright spec, which now passes; this is a runtime-config fix with no new unit-testable surface.N/A: the changed lines are tokio runtime-builder config, exercised by the e2e lane (not unit-coverable).N/A: no feature row added/removed/renamed.## Related—N/A.N/A.Closes #NNN—N/A: no tracking issue.Impact
openhuman-core run) — used by the Playwright web E2E lane, docker, and cloud deployments. Memory: +14 MiB of reserved (not committed) stack per worker thread.Related
Box::pinmitigation, insufficient on its own).Summary by CodeRabbit