fix(tools): Add browser crash detection and automatic session recovery#2738
fix(tools): Add browser crash detection and automatic session recovery#2738VascoSch92 wants to merge 6 commits intomainfrom
Conversation
When the browser crashes, BrowserToolExecutor now detects consecutive failures and automatically resets the session instead of looping on the dead connection with 300-second timeouts. Changes: - Track consecutive action failures in BrowserToolExecutor - After 3 consecutive failures, set _initialized=False to trigger session re-creation on the next action - Use a shorter timeout (30s) after a failure to avoid long waits against a potentially dead browser - Reset failure counter on success - Include recovery context in error messages so the agent knows the browser was restarted Fixes #2412 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
Pragmatic solution to a real problem. The recovery logic is straightforward and well-tested. However, this changes tool execution behavior (timeouts and automatic recovery), which per repo guidelines requires eval verification before approval. Please run lightweight evals to ensure no unexpected benchmark impact.
Coverage Report •
|
||||||||||||||||||||
…mport Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attempt to cleanup the old browser process/session before setting _initialized=False. Uses a short 5s timeout since the browser may be crashed or wedged. Failures are caught and logged at debug level since they're expected when the browser is truly dead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Pragmatic solution to a real problem, but the design conflates browser crashes with legitimate action errors.
Verdict: ❌ Needs eval verification - Per the previous review and repo guidelines, changes to tool execution behavior (timeouts, automatic recovery) require lightweight eval runs before approval. Please run evals to verify no unexpected benchmark impact.
Regular action errors (invalid selector, missing element) are normal agent mistakes, not browser crashes. Only timeouts indicate a potentially dead/hung browser and should count toward session reset. Also makes degraded timeout less aggressive: kicks in after 2+ consecutive timeouts instead of after any single failure. Adds test for action errors not triggering reset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
Pragmatic solution to a real problem. The recovery logic is straightforward and well-tested. However, this changes tool execution behavior (timeouts and automatic recovery), which per repo guidelines requires eval verification before approval. Please run lightweight evals to ensure no unexpected benchmark impact.
|
@OpenHands fix the pre-commit check |
|
I'm on it! VascoSch92 can track my progress at all-hands.dev |
Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI fixed the pre-commit check failure on PR #2738. What was done:
Checklist:
|
|
[Automatic Post]: It has been a while since there was any activity on this PR. @VascoSch92, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. |
Summary
BrowserToolExecutor: after 3 consecutive action failures, the browser session is automatically reset (_initialized = False) so the next action re-creates itPreviously, when the browser crashed, every subsequent action would block for 300 seconds before timing out, effectively making the agent "stuck". Now the executor detects the pattern and recovers automatically.
Fixes #2412
Test plan
test_issue_2412_consecutive_failures_reset_session,test_issue_2412_success_resets_failure_counter,test_issue_2412_degraded_timeout_after_failure🤖 Generated with Claude Code
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:4bcd5e7-pythonRun
All tags pushed for this build
About Multi-Architecture Support
4bcd5e7-python) is a multi-arch manifest supporting both amd64 and arm644bcd5e7-python-amd64) are also available if needed