demo: test full benchmark-eval check requires all 4 factory benchmarks#675
demo: test full benchmark-eval check requires all 4 factory benchmarks#675RobotSail wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #675 +/- ##
=======================================
Coverage 87.84% 87.84%
=======================================
Files 70 70
Lines 10636 10636
=======================================
Hits 9343 9343
Misses 1293 1293 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Benchmark Resultsfeaturebench
Full JSON{
"benchmark": "featurebench",
"instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
"solver": "factory",
"passed": 0,
"total": 1,
"score": 0,
"resolved": false,
"duration_seconds": 2710,
"status": "success",
"timestamp": "20260622T010340Z",
"details": {
"pass_rate": 0,
"solver": "factory",
"cost_usd": 26.382671450000004,
"input_tokens": 4524,
"output_tokens": 134807,
"cache_read_tokens": 28296214,
"cache_creation_tokens": 0
}
}swebench
Full JSON{
"benchmark": "swebench",
"instance_id": "sympy__sympy-20590",
"solver": "factory",
"passed": 0,
"total": 1,
"score": 0,
"resolved": false,
"duration_seconds": 1696,
"status": "success",
"timestamp": "20260622T010340Z",
"details": {
"solver": "factory",
"cost_usd": 14.787371750000002,
"input_tokens": 1035,
"output_tokens": 59611,
"cache_read_tokens": 12602860,
"cache_creation_tokens": 0
}
}terminalbench
Full JSON{
"benchmark": "terminalbench",
"instance_id": "fix-git",
"solver": "factory",
"passed": 0,
"total": 1,
"score": 0,
"resolved": false,
"duration_seconds": 1708,
"status": "failed",
"timestamp": "20260622T010344Z",
"details": {
"solver": "factory",
"cost_usd": 0,
"input_tokens": 0,
"output_tokens": 0,
"cache_read_tokens": 0,
"cache_creation_tokens": 0
}
}programbench
Full JSON{
"benchmark": "programbench",
"instance_id": "abishekvashok__cmatrix.5c082c6",
"solver": "factory",
"passed": 1,
"total": 769,
"score": 0.0013,
"resolved": true,
"duration_seconds": 1201,
"status": "success",
"timestamp": "20260622T010345Z",
"details": {
"solver": "factory",
"cost_usd": 10.304329,
"input_tokens": 210,
"output_tokens": 47047,
"cache_read_tokens": 12545658,
"cache_creation_tokens": 0
}
}Overall: 0.0% accuracy (= +0.0% vs main) | $17.16 avg cost | 1829s avg duration Comparison vs Main
Baseline: latest main branch run per benchmark+solver. ▲ = improvement, ▼ = regression. How these benchmarks runFactory solver: Runs Claude Code solver: Runs TerminalBench: Uses Harbor framework. Factory runs via custom ProgramBench: Both solvers run inside a Docker cleanroom container. See Config: |
Test PR to verify the benchmark-eval status check requires all 4 factory benchmarks to run before flipping to success.