Skip to content

feat(benchmarks): add benchmark definitions and runner script#54

Open
NeuralEmpowerment wants to merge 5 commits intomainfrom
feat/benchmark-infrastructure
Open

feat(benchmarks): add benchmark definitions and runner script#54
NeuralEmpowerment wants to merge 5 commits intomainfrom
feat/benchmark-infrastructure

Conversation

@NeuralEmpowerment
Copy link
Contributor

Summary

Adds durable benchmark infrastructure for recording agent sessions consistently.

Changes

  • benchmarks.yaml: Defines reusable test scenarios with:

    • Prompt templates
    • Expected event counts and costs
    • Trigger behaviors to verify
  • run_benchmark.sh: Script to run benchmarks via docker-compose

Available Benchmarks

Name Description Expected Cost
simple-math Baseline without tools $0.001
context-window-growth Multi-language implementation $0.15
context-compaction Explicitly triggers /compact command $0.20
multi-tool Tool call sequences $0.02
subagent-demo Subagent spawning $0.10

Usage

./scripts/run_benchmark.sh context-compaction

Related

  • AEF Issue #67 (Capture compaction recording)

Adds durable benchmark infrastructure for recording agent sessions:
- benchmarks.yaml: Defines reusable test scenarios with expected costs
- run_benchmark.sh: Script to run benchmarks via docker-compose

Includes benchmarks for:
- simple-math: Baseline without tools
- context-window-growth: Multi-language implementation
- context-compaction: Explicitly triggers /compact command
- multi-tool: Tool call sequences
- subagent-demo: Subagent spawning
Copilot AI review requested due to automatic review settings February 3, 2026 20:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a small benchmarking infrastructure to consistently record Claude CLI agent sessions via Docker Compose.

Changes:

  • Adds scripts/run_benchmark.sh to select a benchmark scenario from YAML, ensure prerequisites (yq, API key), and run the recording via docker-compose.record.yaml.
  • Adds providers/workspaces/claude-cli/fixtures/benchmarks.yaml defining several reusable benchmark scenarios (simple math, context window growth, compaction, multi-tool, and subagent behavior) with metadata like expected events/cost and trigger tags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
scripts/run_benchmark.sh New helper script that loads a benchmark definition, prints summary info, wires TASK/PROMPT/API key, and runs the recording Docker Compose stack.
providers/workspaces/claude-cli/fixtures/benchmarks.yaml Defines named benchmark scenarios and usage notes for recording and validating agent sessions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 37 to 42
if ! yq -e ".benchmarks.$BENCHMARK" "$BENCHMARKS_FILE" > /dev/null 2>&1; then
echo "Error: Benchmark '$BENCHMARK' not found"
echo ""
echo "Available benchmarks:"
yq '.benchmarks | keys | .[]' "$BENCHMARKS_FILE"
exit 1
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yq queries that reference .benchmarks.$BENCHMARK will fail for all of the defined benchmarks because their keys contain hyphens (e.g., simple-math, context-window-growth), which yq/jq interpret as subtraction rather than part of the key. To reliably handle these benchmark names, the queries here (and in the extraction block below) should index the map using bracket notation with a quoted key, e.g. .benchmarks["$BENCHMARK"]...., instead of dot notation with an unquoted, hyphenated identifier.

Copilot uses AI. Check for mistakes.
# # Or manually:
# cd providers/workspaces/claude-cli
# TASK="context-window-growth" \
# PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this usage example, the path ../fixtures/benchmarks.yaml is incorrect relative to providers/workspaces/claude-cli—the actual file lives at fixtures/benchmarks.yaml under that directory. Also, the yq selector .benchmarks.context-window-growth.prompt uses an unquoted, hyphenated key, which will be parsed as subtraction; this should be changed to bracket notation with a quoted key (e.g., .benchmarks["context-window-growth"].prompt) to match the actual benchmark name.

Suggested change
# PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \
# PROMPT="$(yq '.benchmarks["context-window-growth"].prompt' fixtures/benchmarks.yaml)" \

Copilot uses AI. Check for mistakes.
#
# Requires: yq (brew install yq)
#
# See fixtures/benchmarks.yaml for available benchmarks
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment points to fixtures/benchmarks.yaml without a full path, but the actual file is under providers/workspaces/claude-cli/fixtures/benchmarks.yaml, which can be confusing when running the script from the repo root. To make the documentation accurate and easier to follow, update the reference here to include the full relative path that the script uses for BENCHMARKS_FILE.

Suggested change
# See fixtures/benchmarks.yaml for available benchmarks
# See providers/workspaces/claude-cli/fixtures/benchmarks.yaml for available benchmarks

Copilot uses AI. Check for mistakes.
- Add v2.1.29_claude-sonnet-4-5_multi-model-usage/ recording
  - Shows Sonnet + Haiku in modelUsage breakdown
  - 16 events, $0.095 cost
  - Includes workspace files (h2o_lightyear.py)

- Fix docker-compose.record.yaml YAML syntax
  - Change array+block-scalar to list format for command

- Update benchmarks.yaml
  - Rename context-compaction to multi-model-usage
  - Add note that /compact is interactive (can't use in -p mode)

- Update README with new recording
- Use bracket notation for hyphenated yq keys (fixes subtraction parsing)
- Fix path reference in benchmarks.yaml usage example
- Use full path in run_benchmark.sh comment
Copilot AI review requested due to automatic review settings February 3, 2026 22:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Usage:
# ./scripts/run_benchmark.sh <benchmark-name>
# ./scripts/run_benchmark.sh context-window-growth
# ./scripts/run_benchmark.sh context-compaction
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage examples here reference a context-compaction benchmark, but fixtures/benchmarks.yaml does not define a context-compaction entry, so running ./scripts/run_benchmark.sh context-compaction will currently fail with "Benchmark 'context-compaction' not found". Either add a context-compaction benchmark definition or update the examples (and PR description) to reference an existing benchmark such as multi-model-usage.

Suggested change
# ./scripts/run_benchmark.sh context-compaction
# ./scripts/run_benchmark.sh multi-model-usage

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +123
# Metadata
version: 1
last_updated: "2026-02-03"
notes: |
Context compaction requires ~190K+ tokens to trigger.
This typically means a very long session or analyzing a large codebase.
The "context-compaction" benchmark may not always trigger compaction
depending on Claude's context window size at the time.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metadata notes describe a "context-compaction" benchmark, and the PR description also lists context-compaction as an available benchmark, but the benchmarks: map above only defines simple-math, context-window-growth, multi-model-usage, multi-tool, and subagent-demo—there is no context-compaction entry. Please either add a concrete context-compaction benchmark or update these notes and the PR description to match the actual set of defined benchmarks.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,17 @@
{"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recording metadata header reports "cli_version": "2.0.74" and "task": "context-compaction", but the directory and README entry name this fixture v2.1.29_claude-sonnet-4-5_multi-model-usage, with CLI version shown as 2.1.29 and the description focused on multi-model usage. To avoid confusing consumers of this fixture, please align the metadata (at least cli_version, and ideally task) with the directory name and README row—for example by updating the header to use cli_version: "2.1.29" and a task label that matches multi-model-usage.

Suggested change
{"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}
{"_recording": {"version": 1, "cli_version": "2.1.29", "model": "claude-sonnet-4-5", "provider": "claude", "task": "multi-model-usage", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,89 @@
#!/bin/bash
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other bash scripts in this repo (for example scripts/validate-stacks.sh:1) use #!/usr/bin/env bash for the shebang, whereas this script hard-codes #!/bin/bash. For portability and consistency with existing scripts, consider switching to #!/usr/bin/env bash here as well.

Copilot uses AI. Check for mistakes.
BREAKING CHANGE: parse_line() now returns list[ObservabilityEvent] instead of ObservabilityEvent | None

Previously, assistant messages with tool_use would only emit TOOL_EXECUTION_STARTED,
losing the token usage data from message.usage. This caused token counts to show 0
when tools were being used.

Changes:
- parse_line() returns list (may be empty, or have multiple events)
- _handle_assistant() now emits TOKEN_USAGE first, then tool events
- _handle_user() also returns list for consistency
- Updated all tests to handle list return type
- Added test for dual-event emission

This fixes token tracking for sessions with tool calls.
Added tools_used field to ObservabilityEvent and populate it when
creating SUBAGENT_STOPPED events. This tracks which tools each
subagent used during execution.

The SubagentState was already tracking this data, just wasn't
passing it to the event.
Copilot AI review requested due to automatic review settings February 3, 2026 22:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fi

# Extract benchmark details (use bracket notation for hyphenated keys)
PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROMPT is assigned using unquoted command substitution, which will collapse newlines in multi-line prompts into spaces and lose the formatting from benchmarks.yaml (e.g., for the block scalar prompts). To preserve line breaks when reading prompts with yq, wrap the command substitution in double quotes so the environment variable retains the original multi-line text.

Suggested change
PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")
PROMPT="$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")"

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +8
# ./scripts/run_benchmark.sh <benchmark-name>
# ./scripts/run_benchmark.sh context-window-growth
# ./scripts/run_benchmark.sh context-compaction
#
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage examples reference a context-compaction benchmark, but there is no corresponding context-compaction key in benchmarks.yaml, so running this example will always fail the benchmark existence check. Either add a context-compaction entry to benchmarks.yaml or update the usage examples to point at an existing benchmark name.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +122
# Parse to events (may return multiple for assistant messages with tool_use)
events = self._parser.parse_line(line)
self._events.extend(events)

# Yield each event with the line (first event gets the line, rest get None indicator)
if events:
yield line, events[0]
for event in events[1:]:
yield "", event # Empty string indicates continuation event
else:
yield line, None
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that parse_line() can return multiple events per line and tee() flattens them into self._events, the replay path above (if self._consumed) still assumes a 1:1 mapping between self._raw_lines and self._events (event = self._events[i]). This means on subsequent calls to tee() (and to events()/raw_lines() which use it), extra events from lines that produced more than one event will be dropped or misaligned with their source line; the buffering structure needs to be adjusted so all events per line are preserved correctly on replay.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant