Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 24, 2026

Summary

Fixes output corruption in RemoteWorkspaceMixin._execute_command_generator() by adding server-side filtering to prevent duplicate events, plus a client-side assertion as a safety check.

Problem

How the polling loop works

When executing a remote bash command, the SDK:

  1. Calls POST /api/bash/start_bash_command to start the command
  2. Polls GET /api/bash/bash_events/search?command_id__eq=... repeatedly until the command completes
  3. Collects stdout/stderr from each BashOutput event and concatenates them

The bug: API returned all events on every poll

The bash events search API returned all events for the command from the beginning on each poll. The polling loop blindly appended every event:

Poll 1: API returns [A]       → appends A         → stdout = "A"
Poll 2: API returns [A, B]    → appends A, B      → stdout = "AAB"
Poll 3: API returns [A, B, C] → appends A, B, C   → stdout = "AABABC"

Expected: ABC, Actual: AABABC

Why this broke trajectory capture

The evaluation pipeline captures trajectories by running:

tar -czf - workspace/conversations | base64

Then decodes with base64.b64decode(stdout). Base64 requires length to be a multiple of 4. Duplicated chunks broke this:

Original: 68 chars (68 % 4 = 0) ✓
Corrupted: 119 chars (119 % 4 = 3) ✗ → "Incorrect padding"

Solution

1. Server-side filtering via order__gt

Added order__gt parameter to the bash events search API:

API changes (bash_router.py, bash_service.py):

@bash_router.get("/bash_events/search")
async def search_bash_events(
    ...
    order__gt: int | None = None,  # NEW: filter events with order > this value
    ...
)

Client changes (remote_workspace_mixin.py):

last_order = -1
while ...:
    params = {"command_id__eq": command_id}
    if last_order >= 0:
        params["order__gt"] = last_order  # Only fetch new events
    
    for event in response["items"]:
        last_order = max(last_order, event["order"])

2. Client-side assertion as safety check

If the API ever returns a duplicate event (bug), fail fast with a clear error:

seen_event_ids: set[str] = set()
for event in items:
    event_id = event.get("id")
    if event_id is not None:
        assert event_id not in seen_event_ids, (
            f"Duplicate event received: {event_id}. "
            f"The API should have filtered this via order__gt"
        )
        seen_event_ids.add(event_id)

Tests

Test Description
test_polling_should_not_duplicate_events_across_iterations Verifies output is not duplicated
test_base64_output_should_decode_correctly Verifies base64 output decodes
test_base64_decode_succeeds_with_order_filtering Verifies fix prevents padding errors
test_assertion_fires_on_duplicate_events Verifies assertion catches API bugs
test_single_poll_works_correctly Baseline test for single-poll case

Test Results

tests/sdk/workspace/remote/test_remote_workspace_mixin.py: 22 passed
tests/sdk/workspace/remote/test_polling_duplicates_output.py: 5 passed
Total: 27 passed

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9693006-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9693006-python \
  ghcr.io/openhands/agent-server:9693006-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9693006-golang-amd64
ghcr.io/openhands/agent-server:9693006-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9693006-golang-arm64
ghcr.io/openhands/agent-server:9693006-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9693006-java-amd64
ghcr.io/openhands/agent-server:9693006-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9693006-java-arm64
ghcr.io/openhands/agent-server:9693006-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9693006-python-amd64
ghcr.io/openhands/agent-server:9693006-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9693006-python-arm64
ghcr.io/openhands/agent-server:9693006-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9693006-golang
ghcr.io/openhands/agent-server:9693006-java
ghcr.io/openhands/agent-server:9693006-python

About Multi-Architecture Support

  • Each variant tag (e.g., 9693006-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 9693006-python-amd64) are also available if needed

This adds tests that demonstrate a bug in RemoteWorkspaceMixin where
the polling loop duplicates stdout/stderr output across multiple poll
iterations.

The bug occurs because:
1. The polling loop fetches ALL events from the beginning on each iteration
2. Events are appended to stdout_parts without deduplication
3. This causes output like: A + B + A + B + C + A + B + C + D

This bug causes base64 decoding failures when capturing conversation
trajectories, as the duplicated base64 output becomes invalid:
- 'Incorrect padding'
- 'Invalid base64-encoded string: number of data characters cannot be
   1 more than a multiple of 4'

The tests document the bug and will help verify the fix.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   bash_router.py431272%78–81, 90–91, 105–110
   bash_service.py1882288%69–71, 140–141, 143–144, 174–176, 253, 258–259, 285–286, 313–314, 316, 323–324, 354–355
openhands-sdk/openhands/sdk/workspace/remote
   remote_workspace_mixin.py119992%312, 318–321, 339, 345–347
TOTAL16818415875% 

Remove unused variables flagged by ruff.

Co-authored-by: openhands <openhands@all-hands.dev>
Tests now assert the CORRECT expected behavior:
- Output should be deduplicated across poll iterations
- Base64 output should decode correctly

These tests FAIL because the bug exists, demonstrating:
- Expected: 'CHUNK1CHUNK2CHUNK3'
- Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3'

When the fix is implemented, these tests will pass.

Co-authored-by: openhands <openhands@all-hands.dev>
Added test_base64_decode_produces_incorrect_padding_error which:
1. Simulates the polling loop with duplicated events
2. Calls base64.b64decode() on the corrupted output
3. Fails with: binascii.Error: Incorrect padding

This reproduces the exact error seen in production logs during
trajectory capture.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 24, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1816 at branch `fix/polling-output-duplication-bug`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

The bash events search API returns ALL events from the beginning on each
poll iteration. Without deduplication, output gets duplicated across polls:
- Poll 1: [A] → append A
- Poll 2: [A, B] → append A, B (A duplicated!)
- Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!)

This caused base64 decoding failures in trajectory capture because the
duplicated output length was no longer a multiple of 4:
- Original: 68 chars (valid)
- Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error)

Fix: Track seen event IDs and skip duplicates. Events without an ID
are processed without deduplication (backwards compatibility).

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig changed the title DRAFT: Add tests for polling output duplication bug fix: deduplicate events in polling loop to prevent output corruption Jan 24, 2026
@neubig neubig changed the title fix: deduplicate events in polling loop to prevent output corruption fix: deduplicate polling events to prevent base64 trajectory capture failures Jan 24, 2026
Instead of client-side deduplication, add server-side filtering:

API changes (bash_router.py, bash_service.py):
- Add order__gt query parameter to /api/bash/bash_events/search
- Filter BashOutput events where order > order__gt
- More efficient: reduces data transfer on each poll

Client changes (remote_workspace_mixin.py):
- Track last_order seen (starts at -1)
- Pass order__gt parameter on subsequent polls
- First poll gets all events, subsequent polls get only new ones

This is more efficient than client-side deduplication because:
- Less data transferred over the network
- Server does the filtering instead of client
- No need to track seen event IDs

Co-authored-by: openhands <openhands@all-hands.dev>
The API should prevent duplicates via order__gt filtering, but add a
client-side assertion as a safety check. If the API ever returns a
duplicate event, we fail fast with a clear error message rather than
silently corrupting output.

Also adds test_assertion_fires_on_duplicate_events to verify this
behavior.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig changed the title fix: deduplicate polling events to prevent base64 trajectory capture failures fix: prevent duplicate events in bash polling via order__gt filtering Jan 24, 2026
@neubig neubig marked this pull request as ready for review January 24, 2026 15:20
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a critical bug with duplicate detection and several other issues that need attention. The main issue is that the duplicate check applies to all event types, but server-side filtering only applies to BashOutput events. See inline comments for details.

@neubig
Copy link
Contributor Author

neubig commented Jan 26, 2026

@OpenHands check the review comments from all-hands-bot and my responses, and fix accordingly. Make sure that CI passes afterwards

@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Add kind__eq=BashOutput to params for server-side filtering
- Change assert to explicit if/raise RuntimeError for duplicate detection
- Move duplicate check inside the BashOutput kind check block
- Update error message to mention both order__gt and kind filtering
- Add tests for mixed event types (BashCommand + BashOutput)

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 26, 2026

I've addressed all the review comments from all-hands-bot and neubig's responses. The changes have been pushed to the fix/polling-output-duplication-bug branch.

Summary of Changes

Code Changes (remote_workspace_mixin.py)

Review Comment Status
✅ Add "kind__eq": "BashOutput" to params for server-side filtering Done
✅ Change assert to explicit if/raise RuntimeError for duplicate detection Done
✅ Move duplicate check inside the if event.get("kind") == "BashOutput" block Done
✅ Update error message to mention both order__gt and kind filtering Done

Test Changes (test_polling_duplicates_output.py)

Review Comment Status
✅ Add test for mixed event types (BashCommand + BashOutput) Done - Added 2 new tests

New Tests Added

  1. test_mixed_event_types_with_kind_filtering - Verifies that BashCommand events are ignored when mixed with BashOutput events across multiple polls
  2. test_bash_command_events_are_ignored - Verifies that BashCommand events are properly skipped during processing

Test Results

All 84 tests in tests/sdk/workspace/remote/ pass, including the 7 tests in test_polling_duplicates_output.py.

The changes have been pushed to the PR branch and CI should now run the tests.

View full conversation

@neubig neubig requested a review from simonrosenberg January 28, 2026 15:48
@simonrosenberg
Copy link
Collaborator

simonrosenberg commented Jan 28, 2026

@neubig Having a test reproducing the error and having it fixed makes it highly probable that the PR fixes the issue. We could also trigger a workflow run with that feature branch and make sure the bug doesn't appear again to validate it definitively?

@simonrosenberg
Copy link
Collaborator

@neubig Having a test reproducing the error and having it fixed makes it highly probable that the PR fixes the issue. We could also trigger a workflow run with that feature branch and make sure the bug doesn't appear again to validate it definitively?

PR approved but I am curious to see it validated on a real run!

@neubig
Copy link
Contributor Author

neubig commented Jan 30, 2026

@simonrosenberg , yes I ran it on swe-bench-multimodal and it fixed the issue I was observing there!

@neubig neubig merged commit 32ca3d8 into main Jan 30, 2026
26 checks passed
@neubig neubig deleted the fix/polling-output-duplication-bug branch January 30, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants