sourcegraph
diff --git a/‎ralph/prd.json‎
Lines changed: 2 additions & 2 deletions b/‎ralph/prd.json‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎ralph/progress.txt‎
Lines changed: 40 additions & 0 deletions b/‎ralph/progress.txt‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎ralph/scripts/__init__.py‎ b/‎ralph/scripts/__init__.py‎
@@ -192,7 +192,7 @@
         "Test by running on existing official runs and verifying REPORT.md is generated with LoCoBench data"
       ],
       "priority": 11,
-      "passes": false,
+      "passes": true,
       "notes": "This is the deterministic-only report. LLM judge evaluation is a separate step that consumes context files generated by US-012. CLI invocation: python3 scripts/generate_eval_report.py --runs-dir ~/evals/custom_agents/agents/claudecode/runs/official/ --output-dir ./eval_reports/. No external dependencies — stdlib only (json, dataclasses, pathlib, statistics, csv, datetime, argparse). The script must be runnable from the CodeContextBench repo root with plain python3."
     },
     {
@@ -209,7 +209,7 @@
         "Also callable as CLI: python3 -m scripts.ccb_metrics.judge_context --runs-dir <path> --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/"
       ],
       "priority": 12,
-      "passes": false,
+      "passes": true,
       "notes": "CLI invocation: python3 -m scripts.ccb_metrics.judge_context --runs-dir ~/evals/custom_agents/agents/claudecode/runs/official/ --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/. Also importable: from scripts.ccb_metrics.judge_context import generate_judge_contexts. Key mapping: task_id in run dirs maps to benchmark dirs by prefix (e.g., 'pkg-doc-001' matches benchmarks/kubernetes_docs/pkg-doc-001/). For LoCoBench, task dirs under benchmarks/locobench_agent/tasks/ match by full task name. For SWE-bench, instruction comes from Harbor's dataset, not local files."
     },
     {
 
@@ -183,3 +183,43 @@ Started: 2026-02-01
   - SWE-bench tasks all have reward=0.0 but partial_score varies (0.0 to 0.9) — partial_score is essential for meaningful comparison
   - Trajectory files may not exist in all runs (missing in BigCode); transcript fallback is critical for tool usage
 ---
+
+## 2026-02-01 - US-011
+- Created `scripts/generate_eval_report.py` — CLI entry point for deterministic evaluation report generation
+- Accepts `--runs-dir`, `--output-dir`, `--csv`/`--no-csv` arguments; `--help` prints full usage
+- Calls `discover_runs()` and wraps results in `EvalReport`, writes `eval_report.json`
+- Generates `REPORT.md` with 6 tables: Run Inventory, Aggregate Performance, Per-Benchmark Breakdown, Efficiency, Tool Utilization, SWE-Bench Pro Partial Scores
+- Writes CSV files (one per table) for downstream analysis
+- Prints summary to stdout: benchmarks, configs, total tasks, pass rates per config
+- Tested on real data: discovered 3 benchmarks, 4 configs, 205 tasks, 7 runs
+- Files changed: `scripts/generate_eval_report.py` (new)
+- **Learnings for future iterations:**
+  - The `eval_reports/` directory is generated output — consider adding to .gitignore
+  - SWE-bench partial scores table is conditional — only generated when SWE-bench data exists
+  - deepsearch_hybrid and sourcegraph_hybrid have very few tasks (1 and 4) in current data — aggregates are unreliable for these
+  - The `_safe_mean()` helper in models.py is also needed in the report generator — duplicated as local helper to avoid modifying models.py
+---
+
+## 2026-02-01 - US-012
+- Created `scripts/ccb_metrics/judge_context.py` with `generate_judge_contexts()` function and CLI entry point
+- Created `scripts/__init__.py` to enable `python3 -m scripts.ccb_metrics.judge_context` invocation
+- For each task in each run, generates a JSON file at `output_dir/<benchmark>/<config>/<task_id>_judge_context.json`
+- Each context file contains: task_id, benchmark, config_name, model, reward, partial_score, task_instructions (from benchmark instruction.md), agent_transcript_summary (first 200 + last 100 lines), agent_output (solution.md or last assistant message), ground_truth, tool_usage_summary (with top 5 tools), code_changes (from Edit/Write tool calls), verifier_output, run_metadata
+- Generates `judge_contexts_index.json` listing all generated context files
+- CLI: `python3 -m scripts.ccb_metrics.judge_context --runs-dir <path> --benchmarks-dir ./benchmarks/ --output-dir ./judge_contexts/`
+- Also importable: `from scripts.ccb_metrics.judge_context import generate_judge_contexts`
+- Handles missing files gracefully — fields are null when source data unavailable
+- Deduplicates tasks across multiple batches (keeps latest)
+- Files changed: `scripts/ccb_metrics/judge_context.py` (new), `scripts/__init__.py` (new)
+- **Tested on real data:** 205 judge context files generated across 3 benchmarks and 4 configs
+- **Learnings for future iterations:**
+  - LoCoBench task_ids map directly to `benchmarks/locobench_agent/tasks/<task_id>/`
+  - BigCode task_ids map directly to `benchmarks/big_code_mcp/<task_id>/`
+  - K8s Docs task_ids map to `benchmarks/kubernetes_docs/<task_id>/`
+  - SWE-bench Pro has no local instruction.md — instructions come from Harbor dataset
+  - LoCoBench "ground truth" is in `solution/` subdirectory (not `ground_truth/`)
+  - BigCode also uses `solution/` for ground truth
+  - K8s Docs uses `ground_truth/` for ground truth
+  - Agent output is in `agent/solution.md` when available, else last assistant text from transcript
+  - `scripts/__init__.py` is needed for `python3 -m scripts.ccb_metrics.judge_context` to work
+---