feat(benchmarks): add benchmark definitions and runner script by NeuralEmpowerment · Pull Request #54 · AgentParadise/agentic-primitives

NeuralEmpowerment · 2026-02-03T20:16:13Z

Summary

Adds durable benchmark infrastructure for recording agent sessions consistently.

Changes

benchmarks.yaml: Defines reusable test scenarios with:
- Prompt templates
- Expected event counts and costs
- Trigger behaviors to verify
run_benchmark.sh: Script to run benchmarks via docker-compose

Available Benchmarks

Name	Description	Expected Cost
simple-math	Baseline without tools	$0.001
context-window-growth	Multi-language implementation	$0.15
context-compaction	Explicitly triggers /compact command	$0.20
multi-tool	Tool call sequences	$0.02
subagent-demo	Subagent spawning	$0.10

Usage

./scripts/run_benchmark.sh context-compaction

Adds durable benchmark infrastructure for recording agent sessions: - benchmarks.yaml: Defines reusable test scenarios with expected costs - run_benchmark.sh: Script to run benchmarks via docker-compose Includes benchmarks for: - simple-math: Baseline without tools - context-window-growth: Multi-language implementation - context-compaction: Explicitly triggers /compact command - multi-tool: Tool call sequences - subagent-demo: Subagent spawning

Copilot

Pull request overview

This PR introduces a small benchmarking infrastructure to consistently record Claude CLI agent sessions via Docker Compose.

Changes:

Adds scripts/run_benchmark.sh to select a benchmark scenario from YAML, ensure prerequisites (yq, API key), and run the recording via docker-compose.record.yaml.
Adds providers/workspaces/claude-cli/fixtures/benchmarks.yaml defining several reusable benchmark scenarios (simple math, context window growth, compaction, multi-tool, and subagent behavior) with metadata like expected events/cost and trigger tags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
scripts/run_benchmark.sh	New helper script that loads a benchmark definition, prints summary info, wires `TASK`/`PROMPT`/API key, and runs the recording Docker Compose stack.
providers/workspaces/claude-cli/fixtures/benchmarks.yaml	Defines named benchmark scenarios and usage notes for recording and validating agent sessions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T20:21:49Z

scripts/run_benchmark.sh

+if ! yq -e ".benchmarks.$BENCHMARK" "$BENCHMARKS_FILE" > /dev/null 2>&1; then
+    echo "Error: Benchmark '$BENCHMARK' not found"
+    echo ""
+    echo "Available benchmarks:"
+    yq '.benchmarks | keys | .[]' "$BENCHMARKS_FILE"
+    exit 1


The yq queries that reference .benchmarks.$BENCHMARK will fail for all of the defined benchmarks because their keys contain hyphens (e.g., simple-math, context-window-growth), which yq/jq interpret as subtraction rather than part of the key. To reliably handle these benchmark names, the queries here (and in the extraction block below) should index the map using bracket notation with a quoted key, e.g. .benchmarks["$BENCHMARK"]...., instead of dot notation with an unquoted, hyphenated identifier.

Copilot · 2026-02-03T20:21:49Z

providers/workspaces/claude-cli/fixtures/benchmarks.yaml

+#   # Or manually:
+#   cd providers/workspaces/claude-cli
+#   TASK="context-window-growth" \
+#   PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \


In this usage example, the path ../fixtures/benchmarks.yaml is incorrect relative to providers/workspaces/claude-cli—the actual file lives at fixtures/benchmarks.yaml under that directory. Also, the yq selector .benchmarks.context-window-growth.prompt uses an unquoted, hyphenated key, which will be parsed as subtraction; this should be changed to bracket notation with a quoted key (e.g., .benchmarks["context-window-growth"].prompt) to match the actual benchmark name.

Suggested change

# PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \

# PROMPT="$(yq '.benchmarks["context-window-growth"].prompt' fixtures/benchmarks.yaml)" \

Copilot · 2026-02-03T20:21:50Z

scripts/run_benchmark.sh

+#
+# Requires: yq (brew install yq)
+#
+# See fixtures/benchmarks.yaml for available benchmarks


This comment points to fixtures/benchmarks.yaml without a full path, but the actual file is under providers/workspaces/claude-cli/fixtures/benchmarks.yaml, which can be confusing when running the script from the repo root. To make the documentation accurate and easier to follow, update the reference here to include the full relative path that the script uses for BENCHMARKS_FILE.

Suggested change

# See fixtures/benchmarks.yaml for available benchmarks

# See providers/workspaces/claude-cli/fixtures/benchmarks.yaml for available benchmarks

- Add v2.1.29_claude-sonnet-4-5_multi-model-usage/ recording - Shows Sonnet + Haiku in modelUsage breakdown - 16 events, $0.095 cost - Includes workspace files (h2o_lightyear.py) - Fix docker-compose.record.yaml YAML syntax - Change array+block-scalar to list format for command - Update benchmarks.yaml - Rename context-compaction to multi-model-usage - Add note that /compact is interactive (can't use in -p mode) - Update README with new recording

- Use bracket notation for hyphenated yq keys (fixes subtraction parsing) - Fix path reference in benchmarks.yaml usage example - Use full path in run_benchmark.sh comment

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T22:37:44Z

scripts/run_benchmark.sh

+# Usage:
+#   ./scripts/run_benchmark.sh <benchmark-name>
+#   ./scripts/run_benchmark.sh context-window-growth
+#   ./scripts/run_benchmark.sh context-compaction


The usage examples here reference a context-compaction benchmark, but fixtures/benchmarks.yaml does not define a context-compaction entry, so running ./scripts/run_benchmark.sh context-compaction will currently fail with "Benchmark 'context-compaction' not found". Either add a context-compaction benchmark definition or update the examples (and PR description) to reference an existing benchmark such as multi-model-usage.

Suggested change

# ./scripts/run_benchmark.sh context-compaction

# ./scripts/run_benchmark.sh multi-model-usage

Copilot · 2026-02-03T22:37:45Z

providers/workspaces/claude-cli/fixtures/benchmarks.yaml

+# Metadata
+version: 1
+last_updated: "2026-02-03"
+notes: |
+  Context compaction requires ~190K+ tokens to trigger.
+  This typically means a very long session or analyzing a large codebase.
+  The "context-compaction" benchmark may not always trigger compaction
+  depending on Claude's context window size at the time.


These metadata notes describe a "context-compaction" benchmark, and the PR description also lists context-compaction as an available benchmark, but the benchmarks: map above only defines simple-math, context-window-growth, multi-model-usage, multi-tool, and subagent-demo—there is no context-compaction entry. Please either add a concrete context-compaction benchmark or update these notes and the PR description to match the actual set of defined benchmarks.

Copilot · 2026-02-03T22:37:45Z

...aces/claude-cli/fixtures/recordings/v2.1.29_claude-sonnet-4-5_multi-model-usage/events.jsonl

@@ -0,0 +1,17 @@
+{"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}


The recording metadata header reports "cli_version": "2.0.74" and "task": "context-compaction", but the directory and README entry name this fixture v2.1.29_claude-sonnet-4-5_multi-model-usage, with CLI version shown as 2.1.29 and the description focused on multi-model usage. To avoid confusing consumers of this fixture, please align the metadata (at least cli_version, and ideally task) with the directory name and README row—for example by updating the header to use cli_version: "2.1.29" and a task label that matches multi-model-usage.

Suggested change

{"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}

{"_recording": {"version": 1, "cli_version": "2.1.29", "model": "claude-sonnet-4-5", "provider": "claude", "task": "multi-model-usage", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}

Copilot · 2026-02-03T22:37:45Z

scripts/run_benchmark.sh

@@ -0,0 +1,89 @@
+#!/bin/bash


Other bash scripts in this repo (for example scripts/validate-stacks.sh:1) use #!/usr/bin/env bash for the shebang, whereas this script hard-codes #!/bin/bash. For portability and consistency with existing scripts, consider switching to #!/usr/bin/env bash here as well.

BREAKING CHANGE: parse_line() now returns list[ObservabilityEvent] instead of ObservabilityEvent | None Previously, assistant messages with tool_use would only emit TOOL_EXECUTION_STARTED, losing the token usage data from message.usage. This caused token counts to show 0 when tools were being used. Changes: - parse_line() returns list (may be empty, or have multiple events) - _handle_assistant() now emits TOKEN_USAGE first, then tool events - _handle_user() also returns list for consistency - Updated all tests to handle list return type - Added test for dual-event emission This fixes token tracking for sessions with tool calls.

Added tools_used field to ObservabilityEvent and populate it when creating SUBAGENT_STOPPED events. This tracks which tools each subagent used during execution. The SubagentState was already tracking this data, just wasn't passing it to the event.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T22:59:22Z

scripts/run_benchmark.sh

+fi
+
+# Extract benchmark details (use bracket notation for hyphenated keys)
+PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")


PROMPT is assigned using unquoted command substitution, which will collapse newlines in multi-line prompts into spaces and lose the formatting from benchmarks.yaml (e.g., for the block scalar prompts). To preserve line breaks when reading prompts with yq, wrap the command substitution in double quotes so the environment variable retains the original multi-line text.

Suggested change

PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")

PROMPT="$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")"

Copilot · 2026-02-03T22:59:22Z

scripts/run_benchmark.sh

+#   ./scripts/run_benchmark.sh <benchmark-name>
+#   ./scripts/run_benchmark.sh context-window-growth
+#   ./scripts/run_benchmark.sh context-compaction
+#


The usage examples reference a context-compaction benchmark, but there is no corresponding context-compaction key in benchmarks.yaml, so running this example will always fail the benchmark existence check. Either add a context-compaction entry to benchmarks.yaml or update the usage examples to point at an existing benchmark name.

Copilot · 2026-02-03T22:59:22Z

lib/python/agentic_isolation/agentic_isolation/providers/claude_cli/output_stream.py

+            # Parse to events (may return multiple for assistant messages with tool_use)
+            events = self._parser.parse_line(line)
+            self._events.extend(events)
+
+            # Yield each event with the line (first event gets the line, rest get None indicator)
+            if events:
+                yield line, events[0]
+                for event in events[1:]:
+                    yield "", event  # Empty string indicates continuation event
+            else:
+                yield line, None


Now that parse_line() can return multiple events per line and tee() flattens them into self._events, the replay path above (if self._consumed) still assumes a 1:1 mapping between self._raw_lines and self._events (event = self._events[i]). This means on subsequent calls to tee() (and to events()/raw_lines() which use it), extra events from lines that produced more than one event will be dropped or misaligned with their source line; the buffering structure needs to be adjusted so all events per line are preserved correctly on replay.

Copilot AI review requested due to automatic review settings February 3, 2026 20:16

Copilot started reviewing on behalf of NeuralEmpowerment February 3, 2026 20:16 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

NeuralEmpowerment added 2 commits February 3, 2026 13:07

fix: address Copilot review comments

4c43030

- Use bracket notation for hyphenated yq keys (fixes subtraction parsing) - Fix path reference in benchmarks.yaml usage example - Use full path in run_benchmark.sh comment

Copilot AI review requested due to automatic review settings February 3, 2026 22:30

Copilot started reviewing on behalf of NeuralEmpowerment February 3, 2026 22:31 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

NeuralEmpowerment added 2 commits February 3, 2026 14:45

Copilot AI review requested due to automatic review settings February 3, 2026 22:52

Copilot started reviewing on behalf of NeuralEmpowerment February 3, 2026 22:52 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

	# PROMPT="$(yq '.benchmarks.context-window-growth.prompt' ../fixtures/benchmarks.yaml)" \
	# PROMPT="$(yq '.benchmarks["context-window-growth"].prompt' fixtures/benchmarks.yaml)" \

	# See fixtures/benchmarks.yaml for available benchmarks
	# See providers/workspaces/claude-cli/fixtures/benchmarks.yaml for available benchmarks

	# ./scripts/run_benchmark.sh context-compaction
	# ./scripts/run_benchmark.sh multi-model-usage

		@@ -0,0 +1,17 @@
		{"_recording": {"version": 1, "cli_version": "2.0.74", "model": "claude-sonnet-4-5", "provider": "claude", "task": "context-compaction", "recorded_at": "2026-02-03T20:17:51.472804+00:00", "duration_ms": 32861, "event_count": 16, "session_id": "d5365e9f-c555-4f7e-9077-74dc11455719", "capture_method": "container_logs"}}

	PROMPT=$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")
	PROMPT="$(yq -r ".benchmarks[\"$BENCHMARK\"].prompt" "$BENCHMARKS_FILE")"

Conversation

NeuralEmpowerment commented Feb 3, 2026

Summary

Changes

Available Benchmarks

Usage

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant