feat(benchmarks): add benchmark definitions and runner script#54
feat(benchmarks): add benchmark definitions and runner script#54NeuralEmpowerment wants to merge 5 commits intomainfrom
Conversation
Adds durable benchmark infrastructure for recording agent sessions: - benchmarks.yaml: Defines reusable test scenarios with expected costs - run_benchmark.sh: Script to run benchmarks via docker-compose Includes benchmarks for: - simple-math: Baseline without tools - context-window-growth: Multi-language implementation - context-compaction: Explicitly triggers /compact command - multi-tool: Tool call sequences - subagent-demo: Subagent spawning
There was a problem hiding this comment.
Pull request overview
This PR introduces a small benchmarking infrastructure to consistently record Claude CLI agent sessions via Docker Compose.
Changes:
- Adds
scripts/run_benchmark.shto select a benchmark scenario from YAML, ensure prerequisites (yq, API key), and run the recording viadocker-compose.record.yaml. - Adds
providers/workspaces/claude-cli/fixtures/benchmarks.yamldefining several reusable benchmark scenarios (simple math, context window growth, compaction, multi-tool, and subagent behavior) with metadata like expected events/cost and trigger tags.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/run_benchmark.sh | New helper script that loads a benchmark definition, prints summary info, wires TASK/PROMPT/API key, and runs the recording Docker Compose stack. |
| providers/workspaces/claude-cli/fixtures/benchmarks.yaml | Defines named benchmark scenarios and usage notes for recording and validating agent sessions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/run_benchmark.sh
Outdated
| if ! yq -e ".benchmarks.$BENCHMARK" "$BENCHMARKS_FILE" > /dev/null 2>&1; then | ||
| echo "Error: Benchmark '$BENCHMARK' not found" | ||
| echo "" | ||
| echo "Available benchmarks:" | ||
| yq '.benchmarks | keys | .[]' "$BENCHMARKS_FILE" | ||
| exit 1 |
There was a problem hiding this comment.
The yq queries that reference .benchmarks.$BENCHMARK will fail for all of the defined benchmarks because their keys contain hyphens (e.g., simple-math, context-window-growth), which yq/jq interpret as subtraction rather than part of the key. To reliably handle these benchmark names, the queries here (and in the extraction block below) should index the map using bracket notation with a quoted key, e.g. .benchmarks["$BENCHMARK"]...., instead of dot notation with an unquoted, hyphenated identifier.
| # # Or manually: | ||
| # cd providers/workspaces/claude-cli | ||
| # TASK="context-window-growth" \ | ||
| # PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \ |
There was a problem hiding this comment.
In this usage example, the path ../fixtures/benchmarks.yaml is incorrect relative to providers/workspaces/claude-cli—the actual file lives at fixtures/benchmarks.yaml under that directory. Also, the yq selector .benchmarks.context-window-growth.prompt uses an unquoted, hyphenated key, which will be parsed as subtraction; this should be changed to bracket notation with a quoted key (e.g., .benchmarks["context-window-growth"].prompt) to match the actual benchmark name.
| # PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \ | |
| # PROMPT="$(yq '.benchmarks["context-window-growth"].prompt' fixtures/benchmarks.yaml)" \ |
scripts/run_benchmark.sh
Outdated
| # | ||
| # Requires: yq (brew install yq) | ||
| # | ||
| # See fixtures/benchmarks.yaml for available benchmarks |
There was a problem hiding this comment.
This comment points to fixtures/benchmarks.yaml without a full path, but the actual file is under providers/workspaces/claude-cli/fixtures/benchmarks.yaml, which can be confusing when running the script from the repo root. To make the documentation accurate and easier to follow, update the reference here to include the full relative path that the script uses for BENCHMARKS_FILE.
| # See fixtures/benchmarks.yaml for available benchmarks | |
| # See providers/workspaces/claude-cli/fixtures/benchmarks.yaml for available benchmarks |
- Add v2.1.29_claude-sonnet-4-5_multi-model-usage/ recording - Shows Sonnet + Haiku in modelUsage breakdown - 16 events, $0.095 cost - Includes workspace files (h2o_lightyear.py) - Fix docker-compose.record.yaml YAML syntax - Change array+block-scalar to list format for command - Update benchmarks.yaml - Rename context-compaction to multi-model-usage - Add note that /compact is interactive (can't use in -p mode) - Update README with new recording
- Use bracket notation for hyphenated yq keys (fixes subtraction parsing) - Fix path reference in benchmarks.yaml usage example - Use full path in run_benchmark.sh comment
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Usage: | ||
| # ./scripts/run_benchmark.sh <benchmark-name> | ||
| # ./scripts/run_benchmark.sh context-window-growth | ||
| # ./scripts/run_benchmark.sh context-compaction |
There was a problem hiding this comment.
The usage examples here reference a context-compaction benchmark, but fixtures/benchmarks.yaml does not define a context-compaction entry, so running ./scripts/run_benchmark.sh context-compaction will currently fail with "Benchmark 'context-compaction' not found". Either add a context-compaction benchmark definition or update the examples (and PR description) to reference an existing benchmark such as multi-model-usage.
| # ./scripts/run_benchmark.sh context-compaction | |
| # ./scripts/run_benchmark.sh multi-model-usage |
| # Metadata | ||
| version: 1 | ||
| last_updated: "2026-02-03" | ||
| notes: | | ||
| Context compaction requires ~190K+ tokens to trigger. | ||
| This typically means a very long session or analyzing a large codebase. | ||
| The "context-compaction" benchmark may not always trigger compaction | ||
| depending on Claude's context window size at the time. |
There was a problem hiding this comment.
These metadata notes describe a "context-compaction" benchmark, and the PR description also lists context-compaction as an available benchmark, but the benchmarks: map above only defines simple-math, context-window-growth, multi-model-usage, multi-tool, and subagent-demo—there is no context-compaction entry. Please either add a concrete context-compaction benchmark or update these notes and the PR description to match the actual set of defined benchmarks.
| @@ -0,0 +1,17 @@ | |||
| {"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}} | |||
There was a problem hiding this comment.
The recording metadata header reports "cli_version": "2.0.74" and "task": "context-compaction", but the directory and README entry name this fixture v2.1.29_claude-sonnet-4-5_multi-model-usage, with CLI version shown as 2.1.29 and the description focused on multi-model usage. To avoid confusing consumers of this fixture, please align the metadata (at least cli_version, and ideally task) with the directory name and README row—for example by updating the header to use cli_version: "2.1.29" and a task label that matches multi-model-usage.
| {"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}} | |
| {"_recording": {"version": 1, "cli_version": "2.1.29", "model": "claude-sonnet-4-5", "provider": "claude", "task": "multi-model-usage", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}} |
| @@ -0,0 +1,89 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Other bash scripts in this repo (for example scripts/validate-stacks.sh:1) use #!/usr/bin/env bash for the shebang, whereas this script hard-codes #!/bin/bash. For portability and consistency with existing scripts, consider switching to #!/usr/bin/env bash here as well.
BREAKING CHANGE: parse_line() now returns list[ObservabilityEvent] instead of ObservabilityEvent | None Previously, assistant messages with tool_use would only emit TOOL_EXECUTION_STARTED, losing the token usage data from message.usage. This caused token counts to show 0 when tools were being used. Changes: - parse_line() returns list (may be empty, or have multiple events) - _handle_assistant() now emits TOKEN_USAGE first, then tool events - _handle_user() also returns list for consistency - Updated all tests to handle list return type - Added test for dual-event emission This fixes token tracking for sessions with tool calls.
Added tools_used field to ObservabilityEvent and populate it when creating SUBAGENT_STOPPED events. This tracks which tools each subagent used during execution. The SubagentState was already tracking this data, just wasn't passing it to the event.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fi | ||
|
|
||
| # Extract benchmark details (use bracket notation for hyphenated keys) | ||
| PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE") |
There was a problem hiding this comment.
PROMPT is assigned using unquoted command substitution, which will collapse newlines in multi-line prompts into spaces and lose the formatting from benchmarks.yaml (e.g., for the block scalar prompts). To preserve line breaks when reading prompts with yq, wrap the command substitution in double quotes so the environment variable retains the original multi-line text.
| PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE") | |
| PROMPT="$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")" |
| # ./scripts/run_benchmark.sh <benchmark-name> | ||
| # ./scripts/run_benchmark.sh context-window-growth | ||
| # ./scripts/run_benchmark.sh context-compaction | ||
| # |
There was a problem hiding this comment.
The usage examples reference a context-compaction benchmark, but there is no corresponding context-compaction key in benchmarks.yaml, so running this example will always fail the benchmark existence check. Either add a context-compaction entry to benchmarks.yaml or update the usage examples to point at an existing benchmark name.
| # Parse to events (may return multiple for assistant messages with tool_use) | ||
| events = self._parser.parse_line(line) | ||
| self._events.extend(events) | ||
|
|
||
| # Yield each event with the line (first event gets the line, rest get None indicator) | ||
| if events: | ||
| yield line, events[0] | ||
| for event in events[1:]: | ||
| yield "", event # Empty string indicates continuation event | ||
| else: | ||
| yield line, None |
There was a problem hiding this comment.
Now that parse_line() can return multiple events per line and tee() flattens them into self._events, the replay path above (if self._consumed) still assumes a 1:1 mapping between self._raw_lines and self._events (event = self._events[i]). This means on subsequent calls to tee() (and to events()/raw_lines() which use it), extra events from lines that produced more than one event will be dropped or misaligned with their source line; the buffering structure needs to be adjusted so all events per line are preserved correctly on replay.
Summary
Adds durable benchmark infrastructure for recording agent sessions consistently.
Changes
benchmarks.yaml: Defines reusable test scenarios with:
run_benchmark.sh: Script to run benchmarks via docker-compose
Available Benchmarks
Usage
Related