Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Jan 27, 2026

HUMAN: This PR proposes to

  • stop using events list to verify the first few events (part of Proposal: don't use full events history in the OH ecosystem #1824)
  • re-codify the invariants here (system prompt, user message) a bit, and document in comments
  • add tests for an older comment on init_state tests
  • (temporary or not, as wanted) the agent added the tests for conversation restore from another PR, so we can run them here on this. We can send those conversation restore tests back to the other PR, if we think they require more time to think about.

NOTE: unlike others, this PR is not manually tested yet. The code is ready for review, but please don't merge until I or my cat agent give it a run. 😅


Summary

Avoids materializing full conversation history in Agent.init_state().

ConversationState.events is intentionally an EventsListBase (Sequence-like, may be file-backed). This PR removes the list(state.events) full-history materialization (which can be 30k+ events) and replaces it with a small prefix scan.

It also codifies and tests the expected early-event ordering used across SDK + agent-server restore/remote flows.

What changed

Agent.init_state event-ordering checks (bounded)

  • Use len(state.events) (O(1) for EventLog) and scan only state.events[:3].
  • Treat existing SystemPromptEvent as normal restore/resume and skip init (debug log, not warning).
  • Enforce the invariant we actually rely on:
    • If present, SystemPromptEvent must be within the first 3 events.
    • A user MessageEvent must not appear before SystemPromptEvent in that prefix.
  • Still emits a SystemPromptEvent via the provided on_event callback on fresh conversations.

Tests

  • Added SDK unit tests for the above invariants:
    • system prompt emission via on_event
    • skip behavior on restore/resume
    • remote prefix [ConversationStateUpdateEvent, SystemPromptEvent, ...]
    • assertion if user message precedes system prompt in the prefix
  • Added restore-behavior spec tests (ported from PR Add integration-style tests for conversation restore behavior #1799) documenting LocalConversation restore semantics.

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2044e1b-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2044e1b-python \
  ghcr.io/openhands/agent-server:2044e1b-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2044e1b-golang-amd64
ghcr.io/openhands/agent-server:2044e1b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2044e1b-golang-arm64
ghcr.io/openhands/agent-server:2044e1b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2044e1b-java-amd64
ghcr.io/openhands/agent-server:2044e1b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2044e1b-java-arm64
ghcr.io/openhands/agent-server:2044e1b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2044e1b-python-amd64
ghcr.io/openhands/agent-server:2044e1b-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:2044e1b-python-arm64
ghcr.io/openhands/agent-server:2044e1b-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:2044e1b-golang
ghcr.io/openhands/agent-server:2044e1b-java
ghcr.io/openhands/agent-server:2044e1b-python

About Multi-Architecture Support

  • Each variant tag (e.g., 2044e1b-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 2044e1b-python-amd64) are also available if needed

@enyst enyst marked this pull request as ready for review January 27, 2026 07:13
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR successfully addresses the performance issue, but introduces semantic changes that could weaken the defensive checks. See inline comments for specific concerns about assumptions and potential edge cases.

Copy link
Collaborator Author

enyst commented Jan 27, 2026

Addressed reviewer feedback:

  • Confirmed EventLog.__len__ is O(1) (returns cached _length), so using len(state.events) does not materialize history.
  • Avoided weakening semantics by doing a fast head scan and then a bounded tail scan to detect existing SystemPromptEvent / user messages even on resume, without loading full history.
  • Extracted the previous magic numbers into named constants (INIT_STATE_HEAD_CHECK, INIT_STATE_TAIL_CHECK).
  • Reduced debug memory by logging event type names directly instead of materializing full events.

Copy link
Collaborator Author

enyst commented Jan 27, 2026

CI is green again.

I updated Agent.init_state to avoid list(state.events) while preserving the original intent of the defensive checks:

  • len(state.events) is O(1) for file-backed EventLog (cached _length).
  • Added named constants and a tail-bounded fallback scan (head=5, tail=200) to avoid false negatives on resumed conversations.
  • Kept error logging bounded (first 50 event type names).

Ready for re-review.

Copy link
Collaborator Author

enyst commented Jan 27, 2026

Update per discussion:

  • Removed tail-scanning entirely.
  • Documented the actual invariant we want: when init_state runs, SystemPromptEvent (if present) must be at index 0 or 1 (agent-server may emit an initial ConversationStateUpdateEvent first).
  • Implemented prefix-only checks over the first 2 events (no list materialization).

I also ran a multi-run example with openai/gpt-5-nano and observed both planning and execution conversations start with:
[SystemPromptEvent, MessageEvent, ...].

CI is green.

@smolpaws
Copy link

Manual spot-check (event ordering)

I did a concrete spot-check by running a multi-run scenario using openai/gpt-5-nano (planning conversation then execution conversation), and printed event prefixes.

In both conversations I observed the prefix:

[SystemPromptEvent, MessageEvent, ActionEvent, ObservationEvent, ...]

@smolpaws
Copy link

Minor follow-up:

  • state.events is intentionally an EventsListBase (Sequence-like), not a Python list.
  • We should avoid list(state.events) because it would materialize the entire history (can be 30k+ events).

Also simplified the ordering assertion to only inspect the prefix (first 2 events), since that's the invariant we care about.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   agent.py2464980%94, 98, 203–204, 208–209, 211, 217–218, 220, 222–223, 228, 231, 235–238, 274–276, 304–305, 312–313, 345, 398–399, 401, 441, 580–581, 586, 598–599, 604–605, 624–625, 627, 655–656, 662–663, 667, 675–676, 716, 723
TOTAL16430478370% 

@smolpaws
Copy link

Cleanup per restore semantics:

  • Treating SystemPromptEvent already present as normal restore/resume (downgraded from warning to debug, removed “unexpected during normal flow” wording).
  • Pulled in the restore-behavior spec tests from PR Add integration-style tests for conversation restore behavior #1799 (added tests/cross/test_conversation_restore_behavior.py) and verified they pass.

CI is green.

@smolpaws
Copy link

Updated INIT_STATE_PREFIX_EVENTS to 3 to cover the typical remote prefix:

[ConversationStateUpdateEvent, SystemPromptEvent, MessageEvent, ...]

Also clarified in comments that remote initial sync can reorder/insert events at the head, so the prefix scan is only for enforcing the SystemPromptEvent position invariant (index 0 or 1), not for making stronger assumptions about all early user interactions.

CI is green.

@enyst enyst marked this pull request as draft January 27, 2026 11:10
@smolpaws
Copy link

Note: latest cross-tests failure appears unrelated to init_state changes (REST API missing finish ActionEvent in test_events_not_lost_during_client_disconnection). Re-running CI should resolve if it's flaky.

@smolpaws
Copy link

smolpaws commented Jan 27, 2026

Updated init_state: removed unused has_any_llm_event (it was only used for debug logging, not behavior), and corrected comments—remote conversations do not reorder events; they may include an initial ConversationStateUpdateEvent, hence prefix scan = 3.

CI currently has a flaky Blacksmith/buildx failure (java-arm64 token fetch). Retrying CI.

@smolpaws
Copy link

Added unit tests covering Agent.init_state invariants (SDK tests):

  • emits SystemPromptEvent via on_event on fresh state
  • skips emission when SystemPromptEvent already present
  • skips emission when remote prefix has [ConversationStateUpdateEvent, SystemPromptEvent, ...]
  • raises if a user MessageEvent appears in the prefix before the system prompt

Also removed the stale TODO about testing in-place modification.

CI cross-tests are still flaky (WS reconciliation test), but SDK tests cover the init_state behavior deterministically now.

@enyst enyst marked this pull request as ready for review January 27, 2026 12:31
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Excellent performance optimization! The bounded scan approach is a significant improvement over materializing 30k+ events. Test coverage is comprehensive and the change from warning→debug for restore scenarios is appropriate.

Key findings:

  • 🟠 Documentation inconsistency between docstring and implementation
  • 🟡 Consider legacy conversation compatibility
  • 🟡 Variable naming could be more descriptive
  • ✅ Performance improvement is excellent
  • ✅ Test coverage is thorough

f"event_types={event_types}. "
f"This indicates an initialization order bug - init_state should be "
f"called before any user messages are added to the conversation."
raise AssertionError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Using explicit raise AssertionError instead of assert statement is the right choice. The assert statement can be disabled with Python's -O flag, which would silently skip this critical validation.

def on_event(e):
if isinstance(e, SystemPromptEvent):
emitted.append(e)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Excellent test coverage of the init_state invariants. The tests clearly document the expected behavior for:

  • Fresh conversations
  • Restore/resume scenarios
  • Remote prefix patterns
  • Invalid ordering detection

This makes the behavioral contract explicit and testable.

@enyst
Copy link
Collaborator Author

enyst commented Jan 27, 2026

@OpenHands Install yourself gh if you don't have it. Read all inline review threads, including my responses. Address those that I didn't mark as solved and are addressable.

@openhands-ai
Copy link

openhands-ai bot commented Jan 27, 2026

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 27, 2026

Final summary (since the last update)

What I changed

  • Fixed a docstring inconsistency in Agent.init_state: it now correctly describes that we scan 3 events (indices 0–2), while still enforcing that SystemPromptEvent must be at index 0 or 1 (index 2 is included to detect a user message preceding the system prompt).
  • Renamed the constant from INIT_STATE_PREFIX_EVENTS to INIT_STATE_PREFIX_SCAN_WINDOW to make it clearer that this value is the scan window size.

Checklist against your request

  • Installed gh (GitHub CLI) since it wasn’t present.
  • Read all inline review threads (including responses).
  • Addressed the unsolved and addressable review feedback (docstring mismatch + more descriptive constant name).
  • Kept changes minimal and focused (single file, doc + constant rename only).
  • Committed and pushed updates to the existing PR branch (no new PR opened, no PR comments posted).

Verification

  • I double-checked git status and the diff against origin/openhands/bounded-init-state-scan: no uncommitted changes and no local-only diffs remain (everything is already pushed).

View full conversation

@enyst enyst marked this pull request as draft January 28, 2026 05:33
@enyst enyst marked this pull request as ready for review January 28, 2026 05:33
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is an excellent PR! The changes successfully avoid materializing the full event history while maintaining correctness. The documentation is thorough, and the test coverage is comprehensive. I have a few minor suggestions below.

- Move INIT_STATE_PREFIX_SCAN_WINDOW to module level for better discoverability
- Add O(1) note for len() to clarify performance characteristics
- Update log message to be more precise about prefix ordering

Co-authored-by: openhands <[email protected]>
Copy link
Collaborator Author

enyst commented Jan 28, 2026

Addressed the AI reviewer suggestions in commit f2e5090:

  1. Log message precision (🟡): Updated log message to "User message found in prefix before SystemPromptEvent!" - more accurately describes what we're checking.

  2. Module-level constant (🟡): Moved INIT_STATE_PREFIX_SCAN_WINDOW to module level with a descriptive comment for better discoverability.

  3. O(1) len() note (🟢): Added a note clarifying that len() is O(1) for EventLog.

The positive feedback items (✅) required no changes - thanks for the review!

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xingyaoww
Copy link
Collaborator

@OpenHands please fix the failing actions on PR #1840 at branch openhands/bounded-init-state-scan

DO NOT CHANGE IMPLEMENTATION, just fix the test

@openhands-ai
Copy link

openhands-ai bot commented Jan 30, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 30, 2026

Summary

I fixed the flaky test_events_not_lost_during_client_disconnection test in PR #1840 by making the test more robust against race conditions.

Changes Made

The test was failing intermittently in CI with:

AssertionError: REST API should have ActionEvent with finish tool. REST events: ['SystemPromptEvent(N/A)', 'MessageEvent(N/A)', 'ConversationStateUpdateEvent(N/A)']

Root Causes Identified:

  1. Server shutdown wasn't completing before the next test started (port still in use)
  2. Events weren't always persisted when queried immediately after conv.run() returned
  3. Directory cleanup was racing with server shutdown

Fixes Applied to tests/cross/test_remote_conversation_live_server.py:

  1. Increased server shutdown timeout from 2s to 5s to allow more time for graceful shutdown

  2. Added port release check - Wait up to 2 seconds for the port to be released before cleanup, preventing the next test from failing to start its server

  3. Added ignore_errors=True to shutil.rmtree - Handles race conditions where the server may still be writing files during cleanup

  4. Added retry loop for REST API queries - Retries up to 10 times (with 0.2s delay) when fetching events from the REST API to handle eventual consistency

Verification

  • Ran the test 20 times consecutively with 100% pass rate (previously failing ~5% of the time)
  • All 66 cross-tests pass
  • Changes have been pushed to the openhands/bounded-init-state-scan branch

The fix follows the user's instruction to only fix the test without changing implementation code.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Jan 30, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1840 at branch `openhands/bounded-init-state-scan`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@enyst
Copy link
Collaborator Author

enyst commented Jan 30, 2026

@OpenHands See the latest commit here and read all comments: it fixed an unrelated flaky test test_events_not_lost_during_client_disconnection.

Make a new branch from main, and reuse that commit there. Open a new PR with only that test fix.

Then, compare that new PR diff with another PR diff:

I believe both PRs were trying to fix that test. WDYT about the two alternative fixes, which is better or WDYT about the problem? Post your analysis as a comment to PR 1832. (Yes, as a comment the older PR! Not this one.) Note that it will be rendered as markdown.

You can use remote gh token.

You are an agent, continue until you get it all done and investigate whatever you need.

@openhands-ai
Copy link

openhands-ai bot commented Jan 30, 2026

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 30, 2026

Checklist:

Summary of new changes since last update:

  • Verified the branch diff is limited to the single test file change (47 insertions/19 deletions) and confirmed no extraneous modifications.

No further actions required unless you want me to mark the PR ready for review or adjust the PR description.

View full conversation

@enyst
Copy link
Collaborator Author

enyst commented Jan 30, 2026

@xingyaoww The test failure here was the WS issue, due to the lack of reconcile, that you are working to fix in 1832. The agent posted its analysis there, and I agree. I think maybe 1832 is the better fix.

@enyst enyst force-pushed the openhands/bounded-init-state-scan branch from 87fca61 to d7a795c Compare January 30, 2026 09:21
@enyst
Copy link
Collaborator Author

enyst commented Jan 30, 2026

Tested several possible event logs and it worked/failed as expected, merging.

@enyst enyst merged commit 7b7faa2 into main Jan 30, 2026
23 checks passed
@enyst enyst deleted the openhands/bounded-init-state-scan branch January 30, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants