fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server by sanil-23 · Pull Request #3155 · tinyhumansai/openhuman

sanil-23 · 2026-06-01T14:41:00Z

Summary

Fix openhuman-core crashing (SIGABRT / stack overflow) whenever the orchestrator delegates to a sub-agent.
This is the root cause of the Playwright web E2E lane 1/4 failing on every PR (and likely real user crashes with large tool surfaces).
One-line runtime change: give the JSON-RPC server's tokio workers a 16 MiB stack.

Problem

openhuman-core run builds its tokio runtime with the default 2 MiB per-worker-thread stack (src/core/cli.rs). A single agent turn is already a huge async state machine (17.6 KB system prompt + hundreds of tool specs + the nested provider/tool loop). When the orchestrator delegates to a sub-agent (e.g. a research tool → subagent_runner dispatches researcher at spawn_depth=1), a second full turn runs one level down on the same worker stack, overflowing 2 MiB and aborting the entire process:

[agent] executing tool: research
[subagent_runner] dispatching agent_id=researcher … spawn_depth=1 max_spawn_depth=3
thread 'tokio-rt-worker' (…) has overflowed its stack
fatal runtime error: stack overflow, aborting

The code already tries to mitigate this by Box::pin-ing the inner sub-agent future (subagent_runner/ops.rs, ref #2234), but the parent orchestrator turn is itself on the worker stack, so one level of nesting still overflows.

In CI this shows up as Playwright lane 1/4: chat-harness-subagent times out (the orchestrator's final text never renders because the core died mid-turn), then ~26 subsequent specs cascade-fail with ECONNREFUSED 127.0.0.1:17788 — they share the now-dead core (workers: 1, no auto-restart). So "27 failed" is really 1 crash + 26 collateral.

This is not test-only: any real deployment (desktop-via-CLI, docker, cloud) can hit the same abort on a legitimate sub-agent delegation with a large tool surface.

Solution

Set thread_stack_size(16 * 1024 * 1024) on the serve runtime so a subagent-nested agent turn fits comfortably:

let rt = tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .thread_stack_size(16 * 1024 * 1024)
    .build()?;

Verification (local, real stack, isolated ports)

Built the standalone core + web bundle, ran the real openhuman-core + mock backend + web host on isolated ports (core 27788, mock 28473, web 4273), and drove the chat-harness-subagent spec:

	Before	After
Spec	✘ fails (~52s, canary never renders)	✓ passes (7.4s)
Core process	`Abort trap: 6` — `overflowed its stack`	stays alive; no overflow in core.log

With the core no longer dying, the ECONNREFUSED cascade across lane 1/4 disappears.

Submission Checklist

Tests added or updated — N/A: covered by the existing chat-harness-subagent Playwright spec, which now passes; this is a runtime-config fix with no new unit-testable surface.
Diff coverage ≥ 80% — N/A: the changed lines are tokio runtime-builder config, exercised by the e2e lane (not unit-coverable).
Coverage matrix updated — N/A: no feature row added/removed/renamed.
All affected feature IDs listed under ## Related — N/A.
No new external network dependencies introduced.
Manual smoke checklist updated if this touches release-cut surfaces — N/A.
Linked issue closed via Closes #NNN — N/A: no tracking issue.

Impact

Platform: the standalone/CLI/JSON-RPC server (openhuman-core run) — used by the Playwright web E2E lane, docker, and cloud deployments. Memory: +14 MiB of reserved (not committed) stack per worker thread.
CI: fixes Playwright lane 1/4 (and removes the cascade that masked it).
Product: prevents a real core crash on sub-agent delegation under a large tool surface.

Follow-up to feat(agent): cap runtime subagent spawn depth at MAX_SPAWN_DEPTH=3 #2234 (the inner-future Box::pin mitigation, insufficient on its own).
Follow-up issue In-process desktop (Tauri) core runtime still uses the default 2 MiB worker stack — subagent delegation can crash it #3159 — the in-process desktop (Tauri) core runs on a separate tokio runtime that still uses the default 2 MiB worker stack and is exposed to the same crash. Tracked there.
Closes:
Related: In-process desktop (Tauri) core runtime still uses the default 2 MiB worker stack — subagent delegation can crash it #3159 (desktop Tauri runtime, same crash on a different codepath)

Summary by CodeRabbit

Bug Fixes
- Fixed process crashes that could occur during complex nested operations by improving internal stack management for better reliability.

… subagent delegation The standalone `openhuman-core run` server builds its tokio runtime with the default 2 MiB per-worker-thread stack. A single agent turn is already a very large async state machine (system prompt + hundreds of tool specs + the nested provider/tool loop); delegating to a sub-agent runs another full turn one level down. Even with the inner sub-agent future boxed (`subagent_runner::ops`, see tinyhumansai#2234), that nesting overflows the 2 MiB stack and aborts the whole process: thread 'tokio-rt-worker' (...) has overflowed its stack fatal runtime error: stack overflow, aborting (SIGABRT) This takes the JSON-RPC server down mid-request. In the Playwright web E2E lane it manifests as `chat-harness-subagent` timing out (the orchestrator's final text never renders) followed by a cascade of `ECONNREFUSED` failures across every subsequent spec in the worker, because they all share the now-dead core. Set `thread_stack_size(16 MiB)` on the serve runtime so a subagent-nested agent turn fits comfortably. Reproduced and verified locally by driving the real `openhuman-core` + mock + web stack on isolated ports and running the subagent spec: - before: core aborts ("overflowed its stack"), spec fails at ~52s - after: core stays alive, spec passes in 7.4s, no overflow in core.log Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-01T14:41:20Z

📝 Walkthrough

Walkthrough

The PR increases the per-worker thread stack size in the Tokio multi-threaded runtime from the default 2 MiB to 16 MiB. A comment documents that delegating to nested "agent turns" in async contexts can cause stack overflow, necessitating this runtime configuration adjustment in run_server_command.

Changes

Tokio Stack Configuration

Layer / File(s)	Summary
Runtime stack size configuration `src/core/cli.rs`	The Tokio runtime builder configures `thread_stack_size(16 * 1024 * 1024)` with an explanatory comment describing that nested agent turns can overflow the default 2 MiB worker-thread stack.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

tinyhumansai/openhuman#2069: Both PRs directly address Tokio worker stack overflow by increasing the multi-thread runtime's per-worker stack size (thread_stack_size)—this PR updates run_server_command in src/core/cli.rs, while the related PR sets a custom runtime in app/src-tauri/src/lib.rs::run() with a larger stack.

Suggested labels

working

Poem

🐰 A rabbit built a stack so deep,
With 16 megs to safely keep,
Agent turns would overflow before,
Now the threads can ask for more! 🏗️

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: increasing worker stack size to 16 MiB to prevent crashes from subagent delegation in the JSON-RPC server.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

graycyrus

@sanil-23 hey — code looks correct and the diagnosis is solid. The thread_stack_size(16 MiB) fix is the right call here; async state machines that deep will overflow 2 MiB once you add a second nesting level regardless of the inner Box::pin.

One thing worth following up on before this goes in: the PR notes the in-process desktop (Tauri) runtime doesn't get this same fix and is still exposed. That should be a filed issue before merge so it doesn't get lost — it's the same crash risk on a different codepath. If you can drop an issue number in the PR description or "Related" section that'd be ideal.

CI is failing (Rust Core Coverage) and Frontend Coverage is still pending. Once those are green I'll come back and approve. Let me know if you need anything.

sanil-23 · 2026-06-01T15:49:38Z

Thanks @graycyrus! Filed #3159 for the in-process desktop (Tauri) runtime exposure and linked it in the description's Related section. The failing Rust Core Coverage is a pre-existing chain unrelated to this diff — being addressed in #3156. Frontend Coverage and lane 1/4 are green here.

sanil-23 requested a review from a team June 1, 2026 14:41

coderabbitai Bot added the working A PR that is being worked on by the team. label Jun 1, 2026

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

graycyrus reviewed Jun 1, 2026

View reviewed changes

sanil-23 mentioned this pull request Jun 1, 2026

In-process desktop (Tauri) core runtime still uses the default 2 MiB worker stack — subagent delegation can crash it #3159

Closed

Merge remote-tracking branch 'upstream/main' into pr/3155

63bf0f4

sanil-23 closed this Jun 1, 2026

YOMXXX mentioned this pull request Jun 2, 2026

fix(runtime): apply 16 MiB worker stack to desktop core + agent CLI runtimes (#3159) #3175

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155

fix(core): 16 MiB worker stack so subagent delegation doesn't crash the JSON-RPC server#3155
sanil-23 wants to merge 2 commits into
tinyhumansai:mainfrom
sanil-23:fix/subagent-stack-overflow

sanil-23 commented Jun 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

graycyrus left a comment

Uh oh!

sanil-23 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sanil-23 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Verification (local, real stack, isolated ports)

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

sanil-23 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanil-23 commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading