demo: test full benchmark-eval check requires all 4 factory benchmarks by RobotSail · Pull Request #675 · akashgit/remote-factory

RobotSail · 2026-06-22T01:03:17Z

Test PR to verify the benchmark-eval status check requires all 4 factory benchmarks to run before flipping to success.

codecov · 2026-06-22T01:08:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.84%. Comparing base (297cd34) to head (c0739c2).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #675   +/-   ##
=======================================
  Coverage   87.84%   87.84%           
=======================================
  Files          70       70           
  Lines       10636    10636           
=======================================
  Hits         9343     9343           
  Misses       1293     1293

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-06-22T01:49:01Z

Benchmark Results

featurebench

Field	Value
Benchmark	featurebench
Instance	`pypa__packaging.013f3b03.test_metadata.e00b5801.lv1`
Result	❌ NOT RESOLVED
Score	0
Duration	2710s

Full JSON

{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 2710,
  "status": "success",
  "timestamp": "20260622T010340Z",
  "details": {
    "pass_rate": 0,
    "solver": "factory",
    "cost_usd": 26.382671450000004,
    "input_tokens": 4524,
    "output_tokens": 134807,
    "cache_read_tokens": 28296214,
    "cache_creation_tokens": 0
  }
}

swebench

Field	Value
Benchmark	swebench
Instance	`sympy__sympy-20590`
Result	❌ NOT RESOLVED
Score	0
Duration	1696s

Full JSON

{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1696,
  "status": "success",
  "timestamp": "20260622T010340Z",
  "details": {
    "solver": "factory",
    "cost_usd": 14.787371750000002,
    "input_tokens": 1035,
    "output_tokens": 59611,
    "cache_read_tokens": 12602860,
    "cache_creation_tokens": 0
  }
}

terminalbench

Field	Value
Benchmark	terminalbench
Instance	`fix-git`
Result	❌ NOT RESOLVED
Score	0
Duration	1708s

Full JSON

{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1708,
  "status": "failed",
  "timestamp": "20260622T010344Z",
  "details": {
    "solver": "factory",
    "cost_usd": 0,
    "input_tokens": 0,
    "output_tokens": 0,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

programbench

Field	Value
Benchmark	programbench
Instance	`abishekvashok__cmatrix.5c082c6`
Result	✅ RESOLVED
Score	0.0013
Duration	1201s

Full JSON

{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "factory",
  "passed": 1,
  "total": 769,
  "score": 0.0013,
  "resolved": true,
  "duration_seconds": 1201,
  "status": "success",
  "timestamp": "20260622T010345Z",
  "details": {
    "solver": "factory",
    "cost_usd": 10.304329,
    "input_tokens": 210,
    "output_tokens": 47047,
    "cache_read_tokens": 12545658,
    "cache_creation_tokens": 0
  }
}

Overall: 0.0% accuracy (= +0.0% vs main) | $17.16 avg cost | 1829s avg duration

Comparison vs Main

Benchmark	Solver	Score	vs Main	Cost	vs Main	Duration	vs Main
featurebench	factory	0	= 0%	$26.38	= $0.00	2710s	= 0s
swebench	factory	0	= 0%	$14.79	= $0.00	1696s	= 0s
terminalbench	factory	0	= 0%	N/A	N/A	1708s	= 0s
programbench	factory	0.0013	+0.0% =	$10.30	= $0.00	1201s	= 0s

Baseline: latest main branch run per benchmark+solver. ▲ = improvement, ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

demo: test full benchmark check flow

c0739c2

osilkin98 closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo: test full benchmark-eval check requires all 4 factory benchmarks#675

demo: test full benchmark-eval check requires all 4 factory benchmarks#675
RobotSail wants to merge 1 commit into
akashgit:mainfrom
RobotSail:demo/benchmark-check

RobotSail commented Jun 22, 2026

Uh oh!

codecov Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RobotSail commented Jun 22, 2026

Uh oh!

codecov Bot commented Jun 22, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 22, 2026

Benchmark Results

featurebench

swebench

terminalbench

programbench

Comparison vs Main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants