-
Notifications
You must be signed in to change notification settings - Fork 140
fix: prevent duplicate events in bash polling via order__gt filtering #1816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds tests that demonstrate a bug in RemoteWorkspaceMixin where the polling loop duplicates stdout/stderr output across multiple poll iterations. The bug occurs because: 1. The polling loop fetches ALL events from the beginning on each iteration 2. Events are appended to stdout_parts without deduplication 3. This causes output like: A + B + A + B + C + A + B + C + D This bug causes base64 decoding failures when capturing conversation trajectories, as the duplicated base64 output becomes invalid: - 'Incorrect padding' - 'Invalid base64-encoded string: number of data characters cannot be 1 more than a multiple of 4' The tests document the bug and will help verify the fix. Co-authored-by: openhands <openhands@all-hands.dev>
Coverage Report •
|
|||||||||||||||||||||||||||||||||||
Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>
Tests now assert the CORRECT expected behavior: - Output should be deduplicated across poll iterations - Base64 output should decode correctly These tests FAIL because the bug exists, demonstrating: - Expected: 'CHUNK1CHUNK2CHUNK3' - Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3' When the fix is implemented, these tests will pass. Co-authored-by: openhands <openhands@all-hands.dev>
Added test_base64_decode_produces_incorrect_padding_error which: 1. Simulates the polling loop with duplicated events 2. Calls base64.b64decode() on the corrupted output 3. Fails with: binascii.Error: Incorrect padding This reproduces the exact error seen in production logs during trajectory capture. Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
The bash events search API returns ALL events from the beginning on each poll iteration. Without deduplication, output gets duplicated across polls: - Poll 1: [A] → append A - Poll 2: [A, B] → append A, B (A duplicated!) - Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!) This caused base64 decoding failures in trajectory capture because the duplicated output length was no longer a multiple of 4: - Original: 68 chars (valid) - Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error) Fix: Track seen event IDs and skip duplicates. Events without an ID are processed without deduplication (backwards compatibility). Co-authored-by: openhands <openhands@all-hands.dev>
Instead of client-side deduplication, add server-side filtering: API changes (bash_router.py, bash_service.py): - Add order__gt query parameter to /api/bash/bash_events/search - Filter BashOutput events where order > order__gt - More efficient: reduces data transfer on each poll Client changes (remote_workspace_mixin.py): - Track last_order seen (starts at -1) - Pass order__gt parameter on subsequent polls - First poll gets all events, subsequent polls get only new ones This is more efficient than client-side deduplication because: - Less data transferred over the network - Server does the filtering instead of client - No need to track seen event IDs Co-authored-by: openhands <openhands@all-hands.dev>
The API should prevent duplicates via order__gt filtering, but add a client-side assertion as a safety check. If the API ever returns a duplicate event, we fail fast with a clear error message rather than silently corrupting output. Also adds test_assertion_fires_on_duplicate_events to verify this behavior. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a critical bug with duplicate detection and several other issues that need attention. The main issue is that the duplicate check applies to all event types, but server-side filtering only applies to BashOutput events. See inline comments for details.
openhands-sdk/openhands/sdk/workspace/remote/remote_workspace_mixin.py
Outdated
Show resolved
Hide resolved
openhands-sdk/openhands/sdk/workspace/remote/remote_workspace_mixin.py
Outdated
Show resolved
Hide resolved
openhands-sdk/openhands/sdk/workspace/remote/remote_workspace_mixin.py
Outdated
Show resolved
Hide resolved
|
@OpenHands check the review comments from all-hands-bot and my responses, and fix accordingly. Make sure that CI passes afterwards |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Add kind__eq=BashOutput to params for server-side filtering - Change assert to explicit if/raise RuntimeError for duplicate detection - Move duplicate check inside the BashOutput kind check block - Update error message to mention both order__gt and kind filtering - Add tests for mixed event types (BashCommand + BashOutput) Co-authored-by: openhands <openhands@all-hands.dev>
|
I've addressed all the review comments from all-hands-bot and neubig's responses. The changes have been pushed to the Summary of ChangesCode Changes (
|
| Review Comment | Status |
|---|---|
✅ Add "kind__eq": "BashOutput" to params for server-side filtering |
Done |
✅ Change assert to explicit if/raise RuntimeError for duplicate detection |
Done |
✅ Move duplicate check inside the if event.get("kind") == "BashOutput" block |
Done |
| ✅ Update error message to mention both order__gt and kind filtering | Done |
Test Changes (test_polling_duplicates_output.py)
| Review Comment | Status |
|---|---|
| ✅ Add test for mixed event types (BashCommand + BashOutput) | Done - Added 2 new tests |
New Tests Added
test_mixed_event_types_with_kind_filtering- Verifies that BashCommand events are ignored when mixed with BashOutput events across multiple pollstest_bash_command_events_are_ignored- Verifies that BashCommand events are properly skipped during processing
Test Results
All 84 tests in tests/sdk/workspace/remote/ pass, including the 7 tests in test_polling_duplicates_output.py.
The changes have been pushed to the PR branch and CI should now run the tests.
|
@neubig Having a test reproducing the error and having it fixed makes it highly probable that the PR fixes the issue. We could also trigger a workflow run with that feature branch and make sure the bug doesn't appear again to validate it definitively? |
PR approved but I am curious to see it validated on a real run! |
|
@simonrosenberg , yes I ran it on swe-bench-multimodal and it fixed the issue I was observing there! |
Summary
Fixes output corruption in
RemoteWorkspaceMixin._execute_command_generator()by adding server-side filtering to prevent duplicate events, plus a client-side assertion as a safety check.Problem
How the polling loop works
When executing a remote bash command, the SDK:
POST /api/bash/start_bash_commandto start the commandGET /api/bash/bash_events/search?command_id__eq=...repeatedly until the command completesBashOutputevent and concatenates themThe bug: API returned all events on every poll
The bash events search API returned all events for the command from the beginning on each poll. The polling loop blindly appended every event:
Expected:
ABC, Actual:AABABCWhy this broke trajectory capture
The evaluation pipeline captures trajectories by running:
tar -czf - workspace/conversations | base64Then decodes with
base64.b64decode(stdout). Base64 requires length to be a multiple of 4. Duplicated chunks broke this:Solution
1. Server-side filtering via
order__gtAdded
order__gtparameter to the bash events search API:API changes (
bash_router.py,bash_service.py):Client changes (
remote_workspace_mixin.py):2. Client-side assertion as safety check
If the API ever returns a duplicate event (bug), fail fast with a clear error:
Tests
test_polling_should_not_duplicate_events_across_iterationstest_base64_output_should_decode_correctlytest_base64_decode_succeeds_with_order_filteringtest_assertion_fires_on_duplicate_eventstest_single_poll_works_correctlyTest Results
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:9693006-pythonRun
All tags pushed for this build
About Multi-Architecture Support
9693006-python) is a multi-arch manifest supporting both amd64 and arm649693006-python-amd64) are also available if needed