Skip to content

demo: test full benchmark-eval check requires all 4 factory benchmarks#675

Closed
RobotSail wants to merge 1 commit into
akashgit:mainfrom
RobotSail:demo/benchmark-check
Closed

demo: test full benchmark-eval check requires all 4 factory benchmarks#675
RobotSail wants to merge 1 commit into
akashgit:mainfrom
RobotSail:demo/benchmark-check

Conversation

@RobotSail

Copy link
Copy Markdown
Contributor

Test PR to verify the benchmark-eval status check requires all 4 factory benchmarks to run before flipping to success.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.84%. Comparing base (297cd34) to head (c0739c2).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #675   +/-   ##
=======================================
  Coverage   87.84%   87.84%           
=======================================
  Files          70       70           
  Lines       10636    10636           
=======================================
  Hits         9343     9343           
  Misses       1293     1293           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

Copy link
Copy Markdown

Benchmark Results

featurebench

Field Value
Benchmark featurebench
Instance pypa__packaging.013f3b03.test_metadata.e00b5801.lv1
Result ❌ NOT RESOLVED
Score 0
Duration 2710s
Full JSON
{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 2710,
  "status": "success",
  "timestamp": "20260622T010340Z",
  "details": {
    "pass_rate": 0,
    "solver": "factory",
    "cost_usd": 26.382671450000004,
    "input_tokens": 4524,
    "output_tokens": 134807,
    "cache_read_tokens": 28296214,
    "cache_creation_tokens": 0
  }
}

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ❌ NOT RESOLVED
Score 0
Duration 1696s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1696,
  "status": "success",
  "timestamp": "20260622T010340Z",
  "details": {
    "solver": "factory",
    "cost_usd": 14.787371750000002,
    "input_tokens": 1035,
    "output_tokens": 59611,
    "cache_read_tokens": 12602860,
    "cache_creation_tokens": 0
  }
}

terminalbench

Field Value
Benchmark terminalbench
Instance fix-git
Result ❌ NOT RESOLVED
Score 0
Duration 1708s
Full JSON
{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1708,
  "status": "failed",
  "timestamp": "20260622T010344Z",
  "details": {
    "solver": "factory",
    "cost_usd": 0,
    "input_tokens": 0,
    "output_tokens": 0,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

programbench

Field Value
Benchmark programbench
Instance abishekvashok__cmatrix.5c082c6
Result ✅ RESOLVED
Score 0.0013
Duration 1201s
Full JSON
{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "factory",
  "passed": 1,
  "total": 769,
  "score": 0.0013,
  "resolved": true,
  "duration_seconds": 1201,
  "status": "success",
  "timestamp": "20260622T010345Z",
  "details": {
    "solver": "factory",
    "cost_usd": 10.304329,
    "input_tokens": 210,
    "output_tokens": 47047,
    "cache_read_tokens": 12545658,
    "cache_creation_tokens": 0
  }
}

Overall: 0.0% accuracy (= +0.0% vs main) | $17.16 avg cost | 1829s avg duration

Comparison vs Main

Benchmark Solver Score vs Main Cost vs Main Duration vs Main
featurebench factory 0 = 0% $26.38 = $0.00 2710s = 0s
swebench factory 0 = 0% $14.79 = $0.00 1696s = 0s
terminalbench factory 0 = 0% N/A N/A 1708s = 0s
programbench factory 0.0013 +0.0% = $10.30 = $0.00 1201s = 0s

Baseline: latest main branch run per benchmark+solver. ▲ = improvement, ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

@osilkin98 osilkin98 closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants