fix: prevent duplicate events in bash polling via order__gt filtering #1816

neubig · 2026-01-24T13:56:44Z

Summary

Fixes output corruption in RemoteWorkspaceMixin._execute_command_generator() by adding server-side filtering to prevent duplicate events, plus a client-side assertion as a safety check.

Problem

How the polling loop works

When executing a remote bash command, the SDK:

Calls POST /api/bash/start_bash_command to start the command
Polls GET /api/bash/bash_events/search?command_id__eq=... repeatedly until the command completes
Collects stdout/stderr from each BashOutput event and concatenates them

The bug: API returned all events on every poll

The bash events search API returned all events for the command from the beginning on each poll. The polling loop blindly appended every event:

Poll 1: API returns [A]       → appends A         → stdout = "A"
Poll 2: API returns [A, B]    → appends A, B      → stdout = "AAB"
Poll 3: API returns [A, B, C] → appends A, B, C   → stdout = "AABABC"

Expected: ABC, Actual: AABABC

Why this broke trajectory capture

The evaluation pipeline captures trajectories by running:

tar -czf - workspace/conversations | base64

Then decodes with base64.b64decode(stdout). Base64 requires length to be a multiple of 4. Duplicated chunks broke this:

Original: 68 chars (68 % 4 = 0) ✓
Corrupted: 119 chars (119 % 4 = 3) ✗ → "Incorrect padding"

Solution

1. Server-side filtering via `order__gt`

Added order__gt parameter to the bash events search API:

API changes (bash_router.py, bash_service.py):

@bash_router.get("/bash_events/search")
async def search_bash_events(
    ...
    order__gt: int | None = None,  # NEW: filter events with order > this value
    ...
)

Client changes (remote_workspace_mixin.py):

last_order = -1
while ...:
    params = {"command_id__eq": command_id}
    if last_order >= 0:
        params["order__gt"] = last_order  # Only fetch new events
    
    for event in response["items"]:
        last_order = max(last_order, event["order"])

2. Client-side assertion as safety check

If the API ever returns a duplicate event (bug), fail fast with a clear error:

seen_event_ids: set[str] = set()
for event in items:
    event_id = event.get("id")
    if event_id is not None:
        assert event_id not in seen_event_ids, (
            f"Duplicate event received: {event_id}. "
            f"The API should have filtered this via order__gt"
        )
        seen_event_ids.add(event_id)

Tests

Test	Description
`test_polling_should_not_duplicate_events_across_iterations`	Verifies output is not duplicated
`test_base64_output_should_decode_correctly`	Verifies base64 output decodes
`test_base64_decode_succeeds_with_order_filtering`	Verifies fix prevents padding errors
`test_assertion_fires_on_duplicate_events`	Verifies assertion catches API bugs
`test_single_poll_works_correctly`	Baseline test for single-poll case

Test Results

tests/sdk/workspace/remote/test_remote_workspace_mixin.py: 22 passed
tests/sdk/workspace/remote/test_polling_duplicates_output.py: 5 passed
Total: 27 passed

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9693006-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9693006-python \
  ghcr.io/openhands/agent-server:9693006-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9693006-golang-amd64
ghcr.io/openhands/agent-server:9693006-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9693006-golang-arm64
ghcr.io/openhands/agent-server:9693006-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9693006-java-amd64
ghcr.io/openhands/agent-server:9693006-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9693006-java-arm64
ghcr.io/openhands/agent-server:9693006-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9693006-python-amd64
ghcr.io/openhands/agent-server:9693006-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9693006-python-arm64
ghcr.io/openhands/agent-server:9693006-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9693006-golang
ghcr.io/openhands/agent-server:9693006-java
ghcr.io/openhands/agent-server:9693006-python

About Multi-Architecture Support

Each variant tag (e.g., 9693006-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 9693006-python-amd64) are also available if needed

This adds tests that demonstrate a bug in RemoteWorkspaceMixin where the polling loop duplicates stdout/stderr output across multiple poll iterations. The bug occurs because: 1. The polling loop fetches ALL events from the beginning on each iteration 2. Events are appended to stdout_parts without deduplication 3. This causes output like: A + B + A + B + C + A + B + C + D This bug causes base64 decoding failures when capturing conversation trajectories, as the duplicated base64 output becomes invalid: - 'Incorrect padding' - 'Invalid base64-encoded string: number of data characters cannot be 1 more than a multiple of 4' The tests document the bug and will help verify the fix. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-01-24T13:59:44Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-agent-server/openhands/agent_server
bash_router.py	43	12	72%	78–81, 90–91, 105–110
bash_service.py	188	22	88%	69–71, 140–141, 143–144, 174–176, 253, 258–259, 285–286, 313–314, 316, 323–324, 354–355
openhands-sdk/openhands/sdk/workspace/remote
remote_workspace_mixin.py	119	9	92%	312, 318–321, 339, 345–347
TOTAL	16818	4158	75%

Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>

Tests now assert the CORRECT expected behavior: - Output should be deduplicated across poll iterations - Base64 output should decode correctly These tests FAIL because the bug exists, demonstrating: - Expected: 'CHUNK1CHUNK2CHUNK3' - Actual: 'CHUNK1CHUNK1CHUNK2CHUNK1CHUNK2CHUNK3' When the fix is implemented, these tests will pass. Co-authored-by: openhands <openhands@all-hands.dev>

Added test_base64_decode_produces_incorrect_padding_error which: 1. Simulates the polling loop with duplicated events 2. Calls base64.b64decode() on the corrupted output 3. Fails with: binascii.Error: Incorrect padding This reproduces the exact error seen in production logs during trajectory capture. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-24T14:52:29Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1816 at branch `fix/polling-output-duplication-bug`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

The bash events search API returns ALL events from the beginning on each poll iteration. Without deduplication, output gets duplicated across polls: - Poll 1: [A] → append A - Poll 2: [A, B] → append A, B (A duplicated!) - Poll 3: [A, B, C] → append A, B, C (A, B duplicated again!) This caused base64 decoding failures in trajectory capture because the duplicated output length was no longer a multiple of 4: - Original: 68 chars (valid) - Duplicated: 119 chars (119 % 4 = 3 → 'Incorrect padding' error) Fix: Track seen event IDs and skip duplicates. Events without an ID are processed without deduplication (backwards compatibility). Co-authored-by: openhands <openhands@all-hands.dev>

Instead of client-side deduplication, add server-side filtering: API changes (bash_router.py, bash_service.py): - Add order__gt query parameter to /api/bash/bash_events/search - Filter BashOutput events where order > order__gt - More efficient: reduces data transfer on each poll Client changes (remote_workspace_mixin.py): - Track last_order seen (starts at -1) - Pass order__gt parameter on subsequent polls - First poll gets all events, subsequent polls get only new ones This is more efficient than client-side deduplication because: - Less data transferred over the network - Server does the filtering instead of client - No need to track seen event IDs Co-authored-by: openhands <openhands@all-hands.dev>

The API should prevent duplicates via order__gt filtering, but add a client-side assertion as a safety check. If the API ever returns a duplicate event, we fail fast with a clear error message rather than silently corrupting output. Also adds test_assertion_fires_on_duplicate_events to verify this behavior. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Found a critical bug with duplicate detection and several other issues that need attention. The main issue is that the duplicate check applies to all event types, but server-side filtering only applies to BashOutput events. See inline comments for details.

openhands-sdk/openhands/sdk/workspace/remote/remote_workspace_mixin.py

tests/sdk/workspace/remote/test_polling_duplicates_output.py

neubig · 2026-01-26T12:58:29Z

@OpenHands check the review comments from all-hands-bot and my responses, and fix accordingly. Make sure that CI passes afterwards

openhands-ai · 2026-01-26T12:58:39Z

I'm on it! neubig can track my progress at all-hands.dev

- Add kind__eq=BashOutput to params for server-side filtering - Change assert to explicit if/raise RuntimeError for duplicate detection - Move duplicate check inside the BashOutput kind check block - Update error message to mention both order__gt and kind filtering - Add tests for mixed event types (BashCommand + BashOutput) Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-26T13:10:15Z

I've addressed all the review comments from all-hands-bot and neubig's responses. The changes have been pushed to the fix/polling-output-duplication-bug branch.

Summary of Changes

Code Changes (`remote_workspace_mixin.py`)

Review Comment	Status
✅ Add `"kind__eq": "BashOutput"` to params for server-side filtering	Done
✅ Change `assert` to explicit `if/raise RuntimeError` for duplicate detection	Done
✅ Move duplicate check inside the `if event.get("kind") == "BashOutput"` block	Done
✅ Update error message to mention both order__gt and kind filtering	Done

Test Changes (`test_polling_duplicates_output.py`)

Review Comment	Status
✅ Add test for mixed event types (BashCommand + BashOutput)	Done - Added 2 new tests

New Tests Added

test_mixed_event_types_with_kind_filtering - Verifies that BashCommand events are ignored when mixed with BashOutput events across multiple polls
test_bash_command_events_are_ignored - Verifies that BashCommand events are properly skipped during processing

Test Results

All 84 tests in tests/sdk/workspace/remote/ pass, including the 7 tests in test_polling_duplicates_output.py.

The changes have been pushed to the PR branch and CI should now run the tests.

View full conversation

simonrosenberg · 2026-01-28T17:31:49Z

@neubig Having a test reproducing the error and having it fixed makes it highly probable that the PR fixes the issue. We could also trigger a workflow run with that feature branch and make sure the bug doesn't appear again to validate it definitively?

simonrosenberg · 2026-01-28T17:39:42Z

@neubig Having a test reproducing the error and having it fixed makes it highly probable that the PR fixes the issue. We could also trigger a workflow run with that feature branch and make sure the bug doesn't appear again to validate it definitively?

PR approved but I am curious to see it validated on a real run!

neubig · 2026-01-30T17:19:39Z

@simonrosenberg , yes I ran it on swe-bench-multimodal and it fixed the issue I was observing there!

openhands-agent added 3 commits January 24, 2026 14:00

fix: resolve pre-commit lint errors

410eaf0

Remove unused variables flagged by ruff. Co-authored-by: openhands <openhands@all-hands.dev>

neubig changed the title ~~DRAFT: Add tests for polling output duplication bug~~ fix: deduplicate events in polling loop to prevent output corruption Jan 24, 2026

neubig changed the title ~~fix: deduplicate events in polling loop to prevent output corruption~~ fix: deduplicate polling events to prevent base64 trajectory capture failures Jan 24, 2026

openhands-agent added 2 commits January 24, 2026 15:16

neubig changed the title ~~fix: deduplicate polling events to prevent base64 trajectory capture failures~~ fix: prevent duplicate events in bash polling via order__gt filtering Jan 24, 2026

neubig marked this pull request as ready for review January 24, 2026 15:20

all-hands-bot reviewed Jan 24, 2026

View reviewed changes

neubig mentioned this pull request Jan 24, 2026

DRAFT: Combined branch for swe-bench-multimodal evaluation #1818

Closed

neubig requested a review from simonrosenberg January 28, 2026 15:48

simonrosenberg approved these changes Jan 28, 2026

View reviewed changes

neubig merged commit 32ca3d8 into main Jan 30, 2026
26 checks passed

neubig deleted the fix/polling-output-duplication-bug branch January 30, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent duplicate events in bash polling via order__gt filtering #1816

fix: prevent duplicate events in bash polling via order__gt filtering #1816

Uh oh!

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 24, 2026 •

edited

Loading

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

simonrosenberg commented Jan 28, 2026 •

edited

Loading

Uh oh!

simonrosenberg commented Jan 28, 2026

Uh oh!

neubig commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: prevent duplicate events in bash polling via order__gt filtering #1816

fix: prevent duplicate events in bash polling via order__gt filtering #1816

Uh oh!

Conversation

neubig commented Jan 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

How the polling loop works

The bug: API returned all events on every poll

Why this broke trajectory capture

Solution

1. Server-side filtering via order__gt

2. Client-side assertion as safety check

Tests

Test Results

Uh oh!

github-actions bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Jan 24, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Uh oh!

openhands-ai bot commented Jan 26, 2026

Summary of Changes

Code Changes (remote_workspace_mixin.py)

Test Changes (test_polling_duplicates_output.py)

New Tests Added

Test Results

Uh oh!

simonrosenberg commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonrosenberg commented Jan 28, 2026

Uh oh!

neubig commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neubig commented Jan 24, 2026 •

edited by github-actions bot

Loading

1. Server-side filtering via `order__gt`

github-actions bot commented Jan 24, 2026 •

edited

Loading

Code Changes (`remote_workspace_mixin.py`)

Test Changes (`test_polling_duplicates_output.py`)

simonrosenberg commented Jan 28, 2026 •

edited

Loading