Skip to content

test: verify benchmark status check on PR#670

Open
RobotSail wants to merge 2 commits into
akashgit:mainfrom
RobotSail:test-benchmark-status
Open

test: verify benchmark status check on PR#670
RobotSail wants to merge 2 commits into
akashgit:mainfrom
RobotSail:test-benchmark-status

Conversation

@RobotSail

Copy link
Copy Markdown
Contributor

Trivial change to test the new benchmark-eval pending status.

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.84%. Comparing base (c91f086) to head (06ba3bb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #670   +/-   ##
=======================================
  Coverage   87.84%   87.84%           
=======================================
  Files          70       70           
  Lines       10636    10636           
=======================================
  Hits         9343     9343           
  Misses       1293     1293           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Benchmark Results

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ❌ NOT RESOLVED
Score 0
Duration 3690s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 3690,
  "status": "success",
  "timestamp": "20260621T232543Z",
  "details": {
    "solver": "factory",
    "cost_usd": 24.93566595,
    "input_tokens": 2445,
    "output_tokens": 131800,
    "cache_read_tokens": 22194327,
    "cache_creation_tokens": 0
  }
}

Comparison vs Baseline

Benchmark Solver Score vs Baseline Cost vs Baseline Duration
swebench factory 0 = 0.0000 $24.94 = $0.00 3690s

Baseline: latest main branch run. ▲ = improvement (higher score or lower cost), ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant