Skip to content

fix(tools): Add browser crash detection and automatic session recovery#2738

Open
VascoSch92 wants to merge 6 commits intomainfrom
fix/browser-crash-recovery
Open

fix(tools): Add browser crash detection and automatic session recovery#2738
VascoSch92 wants to merge 6 commits intomainfrom
fix/browser-crash-recovery

Conversation

@VascoSch92
Copy link
Copy Markdown
Contributor

@VascoSch92 VascoSch92 commented Apr 7, 2026

Summary

  • Adds crash detection to BrowserToolExecutor: after 3 consecutive action failures, the browser session is automatically reset (_initialized = False) so the next action re-creates it
  • Uses a shorter timeout (30s instead of 300s) after a failure to avoid long cascading waits against a dead browser
  • Resets the failure counter on success
  • Includes recovery context in error messages so the agent knows the browser was restarted

Previously, when the browser crashed, every subsequent action would block for 300 seconds before timing out, effectively making the agent "stuck". Now the executor detects the pattern and recovers automatically.

Fixes #2412

Test plan

  • 3 new tests: test_issue_2412_consecutive_failures_reset_session, test_issue_2412_success_resets_failure_counter, test_issue_2412_degraded_timeout_after_failure
  • All 59 existing browser unit tests pass
  • Manual test: start a browser task, kill chrome, verify agent recovers

🤖 Generated with Claude Code


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:4bcd5e7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-4bcd5e7-python \
  ghcr.io/openhands/agent-server:4bcd5e7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:4bcd5e7-golang-amd64
ghcr.io/openhands/agent-server:4bcd5e7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:4bcd5e7-golang-arm64
ghcr.io/openhands/agent-server:4bcd5e7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:4bcd5e7-java-amd64
ghcr.io/openhands/agent-server:4bcd5e7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:4bcd5e7-java-arm64
ghcr.io/openhands/agent-server:4bcd5e7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:4bcd5e7-python-amd64
ghcr.io/openhands/agent-server:4bcd5e7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:4bcd5e7-python-arm64
ghcr.io/openhands/agent-server:4bcd5e7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:4bcd5e7-golang
ghcr.io/openhands/agent-server:4bcd5e7-java
ghcr.io/openhands/agent-server:4bcd5e7-python

About Multi-Architecture Support

  • Each variant tag (e.g., 4bcd5e7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 4bcd5e7-python-amd64) are also available if needed

When the browser crashes, BrowserToolExecutor now detects consecutive
failures and automatically resets the session instead of looping
on the dead connection with 300-second timeouts.

Changes:
- Track consecutive action failures in BrowserToolExecutor
- After 3 consecutive failures, set _initialized=False to trigger
  session re-creation on the next action
- Use a shorter timeout (30s) after a failure to avoid long waits
  against a potentially dead browser
- Reset failure counter on success
- Include recovery context in error messages so the agent knows
  the browser was restarted

Fixes #2412

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@VascoSch92 VascoSch92 requested a review from all-hands-bot April 7, 2026 10:18
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pragmatic solution to a real problem. The recovery logic is straightforward and well-tested. However, this changes tool execution behavior (timeouts and automatic recovery), which per repo guidelines requires eval verification before approval. Please run lightweight evals to ensure no unexpected benchmark impact.

@VascoSch92 VascoSch92 changed the title Add browser crash detection and automatic session recovery fix(tools): Add browser crash detection and automatic session recovery Apr 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/browser_use
   impl.py27417237%57–61, 63–64, 66, 68, 70–73, 75–76, 78, 80, 87, 105–108, 114–115, 120, 122–124, 126–127, 135–137, 139–143, 148, 199, 204–207, 209, 231–233, 236–238, 240, 253, 290–291, 295, 305, 320–321, 326, 343, 349–350, 355, 360, 366–367, 371–372, 378–379, 387–390, 395–397, 404, 412, 430–431, 433–446, 449–462, 464–465, 471, 476–479, 487, 489, 492–493, 499–500, 505–506, 512–513, 517–518, 522–523, 527, 529–530, 532–535, 538–539, 545, 547, 549, 557–558, 562–563, 568–569, 573–574, 578–579, 584–585, 597–598, 609–610, 614–618, 632–634, 639, 644–645, 654–655
TOTAL220591110249% 

VascoSch92 and others added 3 commits April 7, 2026 12:24
…mport

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attempt to cleanup the old browser process/session before setting
_initialized=False. Uses a short 5s timeout since the browser may
be crashed or wedged. Failures are caught and logged at debug level
since they're expected when the browser is truly dead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@VascoSch92 VascoSch92 requested a review from all-hands-bot April 7, 2026 10:36
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Pragmatic solution to a real problem, but the design conflates browser crashes with legitimate action errors.

Verdict: ❌ Needs eval verification - Per the previous review and repo guidelines, changes to tool execution behavior (timeouts, automatic recovery) require lightweight eval runs before approval. Please run evals to verify no unexpected benchmark impact.

Regular action errors (invalid selector, missing element) are normal
agent mistakes, not browser crashes. Only timeouts indicate a
potentially dead/hung browser and should count toward session reset.

Also makes degraded timeout less aggressive: kicks in after 2+
consecutive timeouts instead of after any single failure.

Adds test for action errors not triggering reset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@VascoSch92 VascoSch92 requested a review from all-hands-bot April 7, 2026 10:49
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pragmatic solution to a real problem. The recovery logic is straightforward and well-tested. However, this changes tool execution behavior (timeouts and automatic recovery), which per repo guidelines requires eval verification before approval. Please run lightweight evals to ensure no unexpected benchmark impact.

@VascoSch92
Copy link
Copy Markdown
Contributor Author

@OpenHands fix the pre-commit check

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Apr 7, 2026

I'm on it! VascoSch92 can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Apr 7, 2026

Summary

I fixed the pre-commit check failure on PR #2738.

What was done:

  • Root cause: The ruff-format pre-commit hook was failing because of a formatting issue in tests/tools/browser_use/test_browser_executor.py
  • Fix applied: Changed a 3-line function call to a single line (line 333) as required by ruff-format:
    # Before
    error_result = BrowserObservation.from_text(
        text="Element not found", is_error=True
    )
    
    # After
    error_result = BrowserObservation.from_text(text="Element not found", is_error=True)
  • Committed and pushed: Commit bf0ed5dd with the fix

Checklist:

@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @VascoSch92, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Agent fails to recover from browser crash during long-running tasks

3 participants