Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions web/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
331 changes: 331 additions & 0 deletions web/README.md

Large diffs are not rendered by default.

683 changes: 683 additions & 0 deletions web/bun.lock

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions web/components.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"$schema": "https://ui.shadcn.com/schema.json",
"style": "new-york",
"rsc": false,
"tsx": true,
"tailwind": {
"config": "tailwind.config.js",
"css": "src/index.css",
"baseColor": "neutral",
"cssVariables": true,
"prefix": ""
},
"iconLibrary": "lucide",
"aliases": {
"components": "@/components",
"utils": "@/lib/utils",
"ui": "@/components/ui",
"lib": "@/lib",
"hooks": "@/hooks"
},
"registries": {}
}
50 changes: 50 additions & 0 deletions web/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Benchmark Data

This directory contains artifacts fetched from GitHub Actions workflows.

## Current Data

**Commit**: `ea446df3c3284cf6be379486a9807d0c48ef7d78`
**Workflow Run**: `19057352801` - "Publish and Benchmark Preview Packages"
**Fetched**: See `metadata.json` for details

## Artifacts Found

### Benchmark Artifacts (12)
- `benchmark-opencode-opencode-claude-sonnet-4-5-prismicio-community-course-fizzi-next`
- `benchmark-opencode-opencode-big-pickle-prismicio-community-course-fizzi-next`
- `benchmark-claude-code-claude-sonnet-4-5-prismicio-community-course-fizzi-next`
- `benchmark-opencode-opencode-claude-sonnet-4-5-AlaminPu1007-algorithm-visualizer`
- `benchmark-opencode-opencode-claude-sonnet-4-5-DataDog-datadog-lambda-python`
- `benchmark-claude-code-claude-sonnet-4-5-DataDog-datadog-lambda-python`
- `benchmark-claude-code-claude-sonnet-4-5-AlaminPu1007-algorithm-visualizer`
- `benchmark-codex-gpt-5-codex-prismicio-community-course-fizzi-next`
- `benchmark-opencode-opencode-big-pickle-DataDog-datadog-lambda-python`
- `benchmark-codex-gpt-5-codex-AlaminPu1007-algorithm-visualizer`
- `benchmark-codex-gpt-5-codex-DataDog-datadog-lambda-python`
- `benchmark-opencode-opencode-big-pickle-AlaminPu1007-algorithm-visualizer`

### Analysis Artifacts (3)
- `analysis-AlaminPu1007-algorithm-visualizer`
- `analysis-prismicio-community-course-fizzi-next`
- `analysis-DataDog-datadog-lambda-python`

## Downloading Artifacts

GitHub Actions artifacts require authentication. To download the artifacts, run:

```bash
GITHUB_TOKEN=your_token_here bun scripts/fetch-artifacts.ts
```

You can create a GitHub Personal Access Token with `actions:read` permission at:
https://github.com/settings/tokens

## Data Structure

Each benchmark artifact contains:
- `benchmark.json` - Full evaluation run export with scores, episodes, and usage data

Each analysis artifact contains:
- `analysis.txt` - Judge analysis text
- `analysis-info.json` - Metadata with eval info and job URL
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"eval": "AlaminPu1007/algorithm-visualizer",
"safe": "AlaminPu1007-algorithm-visualizer",
"url": "https://github.com/sst/opencode-bench/actions/runs/19057352801/job/54432051186#step:7:0"
}
119 changes: 119 additions & 0 deletions web/data/analysis-AlaminPu1007-algorithm-visualizer/analysis.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Cross-Agent Benchmark Analysis

## Overall Performance Pattern

The benchmark reveals remarkably **consistent performance** across the top three agents, with scores clustering tightly between 0.804-0.827 (final) and 0.813-0.870 (base). This narrow range suggests the task has well-defined success criteria that multiple approaches can satisfy effectively.

### Performance Ranking
1. **claude-code (claude-sonnet-4-5)**: 0.827 final — Highest base score (0.870), lowest penalty
2. **opencode (big-pickle)**: 0.827 final — Tied for first, slightly higher penalty (0.043)
3. **opencode (claude-sonnet-4-5)**: 0.804 final — Higher penalty (0.054) dragged down final score
4. **codex (gpt-5-codex)**: 0.770 final — Technical failure prevented summary generation

## Key Insights

### 1. **Model Consistency vs. Scope Management**

**Claude-code's advantage**: Achieved the highest base score (0.870) with minimal penalty (0.043) by maintaining **strict consistency** across episodes. The agent followed an identical approach in all three runs, with only Episode 1 showing minor exploratory variation (schema file search).

**OpenCode's trade-off**: Both OpenCode variants achieved identical base scores to claude-code (0.870 for big-pickle, 0.858 for claude-sonnet-4-5) but incurred higher penalties. The big-pickle variant matched claude-code's penalty, while the claude-sonnet-4-5 variant showed **25% higher penalty** (0.054 vs 0.043), suggesting:
- Episode 2 was "more focused" (documentation only)
- Episodes 1 & 3 included "comprehensive UI refinements"
- **Inconsistent scope across episodes** likely triggered the penalty increase

### 2. **Systematic Approach Patterns**

All successful agents followed a similar workflow:
```
Explore → Identify files → Update version → Document changes → Verify
```

**Differentiation emerged in execution details:**

- **Claude-code**: Used `TodoWrite` for explicit task tracking, demonstrating structured project management
- **OpenCode (big-pickle)**: Emphasized verification with "linting and build verification commands" after changes
- **OpenCode (claude-sonnet-4-5)**: Focused on tool efficiency, using "Edit operations extensively" with occasional Read operations

### 3. **The Episode Consistency Problem**

The **most significant performance differentiator** was episode-to-episode consistency:

- **Claude-code**: "All three episodes achieved identical outcomes"
- **OpenCode (claude-sonnet-4-5)**: "Episode 2 was more focused... Episodes 1 and 3 included comprehensive UI refinements"

This suggests the evaluation system **penalizes scope variation** across episodes, even when individual episodes are successful. The 0.054 penalty for opencode/claude-sonnet-4-5 (vs 0.043 for others) directly correlates with its documented inconsistency.

### 4. **Technical Failure Analysis**

**Codex (gpt-5-codex)** encountered a critical infrastructure issue: "Body has already been used. It can only be read once." This is a **stream consumption error**, not an agent logic failure. Despite this, it achieved:
- Base score: 0.813 (only 7% below the leader)
- Same penalty structure: 0.043

This suggests the agent **completed the task successfully** but failed during post-processing/summary generation, indicating a **tooling issue** rather than capability gap.

## Performance Gaps Analysis

### Largest Delta: 0.057 (claude-code vs codex)
- **Primary factor**: Codex's base score (0.813 vs 0.870) — a 6.5% gap
- **Secondary factor**: Summary generation failure suggests incomplete observability
- **Implication**: The gap may be smaller than it appears if the technical issue masked successful work

### Smallest Delta: 0.000 (claude-code vs opencode/big-pickle)
- Both achieved identical final scores through different paths
- Big-pickle emphasized **verification** (linting, builds)
- Claude-code emphasized **planning** (todo tracking)
- **Implication**: Multiple valid strategies exist for this task type

## Agent Behavioral Tendencies

### Safety vs. Completeness Trade-offs

1. **Claude-code**: Prioritizes **reproducibility** — identical outcomes across episodes suggest conservative, proven approach
2. **OpenCode (big-pickle)**: Prioritizes **validation** — explicit mention of quality checks indicates defensive programming
3. **OpenCode (claude-sonnet-4-5)**: Prioritizes **efficiency** — variable scope suggests optimization for individual episode requirements

### Tool Usage Patterns

- **TodoWrite** (claude-code only): Explicit task management overhead, but may improve consistency
- **Verification commands** (big-pickle only): Quality assurance overhead, but catches errors early
- **Read-then-Edit** (claude-sonnet-4-5): More cautious file modification approach

## Recommendations

### 1. **For Evaluation System Improvement**
- **Clarify penalty criteria**: Document whether episode consistency is required or if adaptive scope is acceptable
- **Fix infrastructure issues**: Resolve stream consumption errors affecting summary generation
- **Add consistency metrics**: Explicitly measure and report episode-to-episode variance

### 2. **For Agent Development**

**To match claude-code's performance:**
- Implement explicit task tracking (TodoWrite or equivalent)
- Standardize episode workflows to minimize variation
- Front-load exploration to avoid mid-episode scope changes

**To improve on current leaders:**
- Combine claude-code's consistency with big-pickle's verification
- Add automated testing to catch regressions
- Implement episode planning phase to determine optimal scope upfront

### 3. **For Future Experiments**

**Test hypothesis**: Does episode consistency matter?
- Run controlled experiment with intentionally variable vs. fixed scope
- Measure penalty impact independently

**Investigate codex failure:**
- Reproduce stream consumption error
- Determine if it's model-specific or infrastructure-wide
- Assess true capability gap once technical issues resolved

**Benchmark verification strategies:**
- Compare outcomes with/without explicit linting steps
- Measure correlation between verification overhead and final scores

## Conclusion

This benchmark reveals a **mature evaluation environment** where multiple agents achieve near-identical results through different optimization strategies. The 2.3% spread between top performers suggests diminishing returns on further optimization without changing the task complexity.

The most actionable insight: **Episode consistency appears more valuable than per-episode optimization**. Claude-code's success stems from reproducible execution, not superior individual episode performance. Future development should prioritize workflow standardization over adaptive intelligence for this task class.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"eval": "DataDog/datadog-lambda-python",
"safe": "DataDog-datadog-lambda-python",
"url": "https://github.com/sst/opencode-bench/actions/runs/19057352801/job/54432051170#step:7:0"
}
160 changes: 160 additions & 0 deletions web/data/analysis-DataDog-datadog-lambda-python/analysis.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Cross-Run Analysis: Batch Item Failures Metric Implementation

## Executive Summary
All four runs successfully implemented the same feature—tracking Lambda batch item failures as an enhanced metric—but with significant variations in execution quality, test coverage, and efficiency. The performance gap between the top performer (opencode/claude-sonnet-4-5 at 0.416) and bottom performer (codex/gpt-5-codex at 0.127) reveals important patterns about agent behavior and evaluation criteria.

---

## 1. Systematic Performance Patterns

### Clear Tier Separation
- **Top Tier** (0.30-0.42): Both Claude Sonnet 4-5 agents (opencode and claude-code)
- **Bottom Tier** (0.13-0.17): Big-pickle and GPT-5-codex agents

The Claude Sonnet 4-5 model demonstrates **2-3x better performance** regardless of agent framework, suggesting model capability is the dominant factor in this benchmark.

### Penalty Analysis
Interestingly, the penalty structure reveals inverse correlation with base scores:
- **opencode/claude-sonnet-4-5**: Highest base (0.522) but also highest penalty (0.106) = 20% reduction
- **codex/gpt-5-codex**: Lowest base (0.144) but also lowest penalty (0.017) = 12% reduction

This suggests the evaluation system may penalize ambitious implementations that attempt more comprehensive solutions, or that higher-performing agents trigger more edge-case violations.

---

## 2. Implementation Quality Differences

### Code Placement & Integration
All agents placed the core function in `datadog_lambda/metric.py` and integrated it into `wrapper.py`, but with subtle differences:

**Consistent Winners (Claude Sonnet 4-5):**
- Precise line number documentation (e.g., "lines 231-271")
- Clear integration points specified (e.g., "lines 294-297 or 366-370")
- Explicit mention of `force_async=True` rationale

**Inconsistent Performers (Big-pickle, GPT-5-codex):**
- Vaguer location descriptions (e.g., "around line 367-386")
- Less detail on integration strategy
- GPT-5-codex used "lazy imports" to avoid circular dependencies—a defensive pattern that may indicate uncertainty

### Validation Logic Depth
All implementations included multi-layer validation, but descriptions vary:

**opencode/claude-sonnet-4-5** explicitly lists 4 validation layers:
1. Enhanced metrics enabled check
2. Response is dictionary
3. `batchItemFailures` exists and is a list
4. Only submit when failures present

**codex/gpt-5-codex** mentions similar checks but less systematically, focusing more on edge cases (None responses, non-dict responses).

---

## 3. Testing Strategy Divergence

### Test Coverage Spectrum

| Agent | Unit Tests | Integration Tests | Test File Strategy |
|-------|-----------|-------------------|-------------------|
| opencode/claude-sonnet-4-5 | 8-9 tests | 3-6 tests (Episode 3: extensive) | Existing files |
| claude-code | 8-10 tests | Episode 3 only | Existing files |
| opencode/big-pickle | 9 tests | 7-8 tests | **New file created** |
| codex/gpt-5-codex | Comprehensive | Not documented | Existing files |

**Critical Insight**: opencode/big-pickle created a **new test file** (`test_metric_with_batch_failures.py`) rather than extending existing test files. This deviation from codebase patterns likely contributed to its lower score, as it:
- Increases maintenance burden
- Fragments test organization
- Suggests misunderstanding of project structure

### Episode-by-Episode Consistency

**opencode/claude-sonnet-4-5**: Shows progressive improvement across episodes, with Episode 3 adding "most extensive integration tests, including specific SQS and Kinesis batch scenarios"—demonstrating learning and refinement.

**opencode/big-pickle**: Episodes 2 and 3 "found the implementation already complete"—suggesting the agent may have cached or reused Episode 1 work rather than re-implementing, which could explain lower scores if the evaluation penalizes this behavior.

---

## 4. Agent Behavioral Tendencies

### Exploration vs. Execution Trade-offs

**codex/gpt-5-codex** shows extreme variation in action counts:
- Episode 1: 74 actions
- Episode 2: **183 actions** (2.5x more)
- Episode 3: 73 actions

Episode 2's extensive exploration ("more extensive investigation of patching strategies") suggests the agent got stuck in analysis paralysis, potentially explaining its low score despite "strong pattern recognition."

**Tool Usage Patterns:**
- GPT-5-codex heavily relied on `bash` commands (`sed`, `grep`) for code inspection
- Claude agents appear to have more direct code understanding, requiring less exploratory tooling
- Missing `rg` (ripgrep) forced fallbacks in GPT-5-codex, indicating environment assumptions

### Safety vs. Completeness

**claude-code** summary notes: "The implementation was consistent across episodes, demonstrating a methodical approach"—prioritizing reliability.

**opencode/big-pickle** emphasizes: "All episodes concluded with successful test execution and code quality validation using pytest and flake8, confirming zero regressions"—prioritizing safety verification.

This defensive posture may explain why big-pickle scored lower: the evaluation may reward feature completeness over regression prevention.

---

## 5. Notable Contrasts & Anomalies

### The "Already Complete" Phenomenon
opencode/big-pickle's Episodes 2-3 finding work "already complete" is highly unusual and suggests:
1. **Caching issue**: Agent reused previous episode state
2. **Evaluation design**: Episodes may not properly isolate runs
3. **Agent confusion**: Misidentified existing code as its own work

This deserves investigation as it could indicate a fundamental evaluation flaw.

### Documentation Quality Paradox
The **most detailed summary** (opencode/claude-sonnet-4-5) correlates with the **highest score**, but also the **highest penalty**. This suggests:
- Detailed documentation may reveal more issues to evaluators
- Or, comprehensive implementations naturally have more edge cases to penalize
- The penalty system may need recalibration to avoid punishing thoroughness

### Model Dominance
The fact that **both Claude Sonnet 4-5 agents** occupy the top tier regardless of agent framework (opencode vs. claude-code) indicates:
- Model capability >> Agent framework design for this task
- The benchmark effectively measures model reasoning over agent orchestration
- Future improvements should focus on model selection over agent architecture

---

## 6. Recommendations

### For Evaluation System Improvements
1. **Investigate penalty calibration**: Why does higher base performance correlate with higher penalties? Consider whether this discourages comprehensive solutions.

2. **Episode isolation verification**: The "already complete" issue in big-pickle runs suggests episodes may not be properly isolated. Verify each episode starts from clean state.

3. **Clarify scoring criteria**: Document whether creating new test files vs. extending existing ones affects scores, and why.

4. **Tool availability standardization**: Ensure all agents have access to expected tools (e.g., `rg`) or document fallback strategies.

### For Agent Development
1. **Pattern recognition training**: GPT-5-codex showed "strong pattern recognition" but still scored lowest—investigate whether this capability translates to correct implementation decisions.

2. **Exploration efficiency**: Implement guardrails to prevent analysis paralysis (e.g., GPT-5-codex's 183-action Episode 2). Consider action budgets or progress metrics.

3. **Codebase structure understanding**: Train agents to extend existing test files rather than creating new ones, following project conventions.

4. **Integration point precision**: Top performers specified exact line numbers and integration rationale—this should be a training target.

### For Future Benchmarks
1. **Add complexity tiers**: This task may be too straightforward for Claude Sonnet 4-5, causing ceiling effects. Consider more challenging scenarios.

2. **Measure efficiency**: Track actions-per-episode and correlate with quality to identify optimal exploration/execution ratios.

3. **Test edge case handling**: The summaries mention validation logic but don't detail how agents handle malformed responses—add specific edge case scenarios to evaluation.

4. **Cross-episode learning**: Evaluate whether agents improve from Episode 1→3 or simply repeat the same approach (as most did here).

---

## Conclusion

This benchmark reveals a **model-dominated landscape** where Claude Sonnet 4-5's reasoning capabilities drive 70-80% of performance variance. However, the **penalty structure anomaly** and **episode isolation concerns** suggest evaluation system refinements could provide clearer signals. The most actionable insight: **comprehensive documentation and testing correlate with higher base scores but also higher penalties**—this relationship deserves deeper investigation to ensure the evaluation rewards rather than punishes thoroughness.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"eval": "prismicio-community/course-fizzi-next",
"safe": "prismicio-community-course-fizzi-next",
"url": "https://github.com/sst/opencode-bench/actions/runs/19057352801/job/54432051205#step:7:0"
}
Loading
Loading