Skip to content

Commit 6783638

Browse files
LoCoBench Botclaude
andcommitted
feat: [US-012] - Generate per-task LLM judge context files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0edff0c commit 6783638

File tree

4 files changed

+516
-2
lines changed

4 files changed

+516
-2
lines changed

ralph/prd.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@
192192
"Test by running on existing official runs and verifying REPORT.md is generated with LoCoBench data"
193193
],
194194
"priority": 11,
195-
"passes": false,
195+
"passes": true,
196196
"notes": "This is the deterministic-only report. LLM judge evaluation is a separate step that consumes context files generated by US-012. CLI invocation: python3 scripts/generate_eval_report.py --runs-dir ~/evals/custom_agents/agents/claudecode/runs/official/ --output-dir ./eval_reports/. No external dependencies — stdlib only (json, dataclasses, pathlib, statistics, csv, datetime, argparse). The script must be runnable from the CodeContextBench repo root with plain python3."
197197
},
198198
{
@@ -209,7 +209,7 @@
209209
"Also callable as CLI: python3 -m scripts.ccb_metrics.judge_context --runs-dir <path> --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/"
210210
],
211211
"priority": 12,
212-
"passes": false,
212+
"passes": true,
213213
"notes": "CLI invocation: python3 -m scripts.ccb_metrics.judge_context --runs-dir ~/evals/custom_agents/agents/claudecode/runs/official/ --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/. Also importable: from scripts.ccb_metrics.judge_context import generate_judge_contexts. Key mapping: task_id in run dirs maps to benchmark dirs by prefix (e.g., 'pkg-doc-001' matches benchmarks/kubernetes_docs/pkg-doc-001/). For LoCoBench, task dirs under benchmarks/locobench_agent/tasks/ match by full task name. For SWE-bench, instruction comes from Harbor's dataset, not local files."
214214
},
215215
{

ralph/progress.txt

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,3 +183,43 @@ Started: 2026-02-01
183183
- SWE-bench tasks all have reward=0.0 but partial_score varies (0.0 to 0.9) — partial_score is essential for meaningful comparison
184184
- Trajectory files may not exist in all runs (missing in BigCode); transcript fallback is critical for tool usage
185185
---
186+
187+
## 2026-02-01 - US-011
188+
- Created `scripts/generate_eval_report.py` — CLI entry point for deterministic evaluation report generation
189+
- Accepts `--runs-dir`, `--output-dir`, `--csv`/`--no-csv` arguments; `--help` prints full usage
190+
- Calls `discover_runs()` and wraps results in `EvalReport`, writes `eval_report.json`
191+
- Generates `REPORT.md` with 6 tables: Run Inventory, Aggregate Performance, Per-Benchmark Breakdown, Efficiency, Tool Utilization, SWE-Bench Pro Partial Scores
192+
- Writes CSV files (one per table) for downstream analysis
193+
- Prints summary to stdout: benchmarks, configs, total tasks, pass rates per config
194+
- Tested on real data: discovered 3 benchmarks, 4 configs, 205 tasks, 7 runs
195+
- Files changed: `scripts/generate_eval_report.py` (new)
196+
- **Learnings for future iterations:**
197+
- The `eval_reports/` directory is generated output — consider adding to .gitignore
198+
- SWE-bench partial scores table is conditional — only generated when SWE-bench data exists
199+
- deepsearch_hybrid and sourcegraph_hybrid have very few tasks (1 and 4) in current data — aggregates are unreliable for these
200+
- The `_safe_mean()` helper in models.py is also needed in the report generator — duplicated as local helper to avoid modifying models.py
201+
---
202+
203+
## 2026-02-01 - US-012
204+
- Created `scripts/ccb_metrics/judge_context.py` with `generate_judge_contexts()` function and CLI entry point
205+
- Created `scripts/__init__.py` to enable `python3 -m scripts.ccb_metrics.judge_context` invocation
206+
- For each task in each run, generates a JSON file at `output_dir/<benchmark>/<config>/<task_id>_judge_context.json`
207+
- Each context file contains: task_id, benchmark, config_name, model, reward, partial_score, task_instructions (from benchmark instruction.md), agent_transcript_summary (first 200 + last 100 lines), agent_output (solution.md or last assistant message), ground_truth, tool_usage_summary (with top 5 tools), code_changes (from Edit/Write tool calls), verifier_output, run_metadata
208+
- Generates `judge_contexts_index.json` listing all generated context files
209+
- CLI: `python3 -m scripts.ccb_metrics.judge_context --runs-dir <path> --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/`
210+
- Also importable: `from scripts.ccb_metrics.judge_context import generate_judge_contexts`
211+
- Handles missing files gracefully — fields are null when source data unavailable
212+
- Deduplicates tasks across multiple batches (keeps latest)
213+
- Files changed: `scripts/ccb_metrics/judge_context.py` (new), `scripts/__init__.py` (new)
214+
- **Tested on real data:** 205 judge context files generated across 3 benchmarks and 4 configs
215+
- **Learnings for future iterations:**
216+
- LoCoBench task_ids map directly to `benchmarks/locobench_agent/tasks/<task_id>/`
217+
- BigCode task_ids map directly to `benchmarks/big_code_mcp/<task_id>/`
218+
- K8s Docs task_ids map to `benchmarks/kubernetes_docs/<task_id>/`
219+
- SWE-bench Pro has no local instruction.md — instructions come from Harbor dataset
220+
- LoCoBench "ground truth" is in `solution/` subdirectory (not `ground_truth/`)
221+
- BigCode also uses `solution/` for ground truth
222+
- K8s Docs uses `ground_truth/` for ground truth
223+
- Agent output is in `agent/solution.md` when available, else last assistant text from transcript
224+
- `scripts/__init__.py` is needed for `python3 -m scripts.ccb_metrics.judge_context` to work
225+
---

ralph/scripts/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)