test: verify benchmark status check on PR#670
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #670 +/- ##
=======================================
Coverage 87.84% 87.84%
=======================================
Files 70 70
Lines 10636 10636
=======================================
Hits 9343 9343
Misses 1293 1293 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Benchmark Resultsswebench
Full JSON{
"benchmark": "swebench",
"instance_id": "sympy__sympy-20590",
"solver": "factory",
"passed": 0,
"total": 1,
"score": 0,
"resolved": false,
"duration_seconds": 3690,
"status": "success",
"timestamp": "20260621T232543Z",
"details": {
"solver": "factory",
"cost_usd": 24.93566595,
"input_tokens": 2445,
"output_tokens": 131800,
"cache_read_tokens": 22194327,
"cache_creation_tokens": 0
}
}Comparison vs Baseline
Baseline: latest main branch run. ▲ = improvement (higher score or lower cost), ▼ = regression. How these benchmarks runFactory solver: Runs Claude Code solver: Runs TerminalBench: Uses Harbor framework. Factory runs via custom ProgramBench: Both solvers run inside a Docker cleanroom container. See Config: |
Trivial change to test the new benchmark-eval pending status.