test: verify benchmark status check on PR by RobotSail · Pull Request #670 · akashgit/remote-factory

RobotSail · 2026-06-21T23:06:24Z

Trivial change to test the new benchmark-eval pending status.

codecov · 2026-06-21T23:08:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.84%. Comparing base (c91f086) to head (06ba3bb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #670   +/-   ##
=======================================
  Coverage   87.84%   87.84%           
=======================================
  Files          70       70           
  Lines       10636    10636           
=======================================
  Hits         9343     9343           
  Misses       1293     1293

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-06-21T23:10:18Z

Benchmark Results

swebench

Field	Value
Benchmark	swebench
Instance	`sympy__sympy-20590`
Result	❌ NOT RESOLVED
Score	0
Duration	3690s

Full JSON

{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 3690,
  "status": "success",
  "timestamp": "20260621T232543Z",
  "details": {
    "solver": "factory",
    "cost_usd": 24.93566595,
    "input_tokens": 2445,
    "output_tokens": 131800,
    "cache_read_tokens": 22194327,
    "cache_creation_tokens": 0
  }
}

Comparison vs Baseline

Benchmark	Solver	Score	vs Baseline	Cost	vs Baseline	Duration
swebench	factory	0	= 0.0000	$24.94	= $0.00	3690s

Baseline: latest main branch run. ▲ = improvement (higher score or lower cost), ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

test: trigger benchmark status check

05f9c87

test: re-trigger benchmark status

06ba3bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: verify benchmark status check on PR#670

test: verify benchmark status check on PR#670
RobotSail wants to merge 2 commits into
akashgit:mainfrom
RobotSail:test-benchmark-status

RobotSail commented Jun 21, 2026

Uh oh!

codecov Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RobotSail commented Jun 21, 2026

Uh oh!

codecov Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

swebench

Comparison vs Baseline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading