feat: close out ralph-eval — all 15 stories passing

sjarmak · claude · sjarmak · commit 9d385e47df56 · 2026-02-20T14:58:15.000Z
Fix US-011 scatter data gap: add task_id/file_recall/reward scatter
points to _compute_per_suite_correlation() JSON output in ir_analysis.py.

Update PRD to mark US-011 through US-014 as passes:true — all were
implemented in prior commits but PRD was never updated during merge.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/ralph-eval/prd.json b/ralph-eval/prd.json
@@ -203,7 +203,7 @@
         "python3 scripts/ir_analysis.py runs without errors (existing behavior preserved when no new flags used)"
       ],
       "priority": 11,
-      "passes": false,
+      "passes": true,
       "notes": "Depends on US-008 (statistics module). Confidence data comes from ground_truth_files.json which already has a confidence field per task. Load it alongside the GT registry."
     },
     {
@@ -221,7 +221,7 @@
         "Debug capture adds < 2 seconds to verifier execution"
       ],
       "priority": 12,
-      "passes": false,
+      "passes": true,
       "notes": "Independent of judge stories. Find verifier_lib.sh location with: grep -r 'verifier_lib' benchmarks/ — it may be in benchmarks/lib/ or copied into each task's tests/. The env filter should use grep -vE 'KEY|TOKEN|SECRET|PASSWORD' to avoid leaking credentials."
     },
     {
@@ -237,7 +237,7 @@
         "python3 scripts/ccb_metrics/ground_truth.py runs without errors and newly extracted defect_type fields appear in output"
       ],
       "priority": 13,
-      "passes": false,
+      "passes": true,
       "notes": "Find code review tasks: grep -rl 'expected_defects' benchmarks/ccb_test/. Defect types should be inferred from the defect descriptions in each file. This is a data enrichment story, not a scoring change."
     },
     {
@@ -254,7 +254,7 @@
         "A README.md in docs/ or inline in the fixture dirs explains the format for future task authors"
       ],
       "priority": 14,
-      "passes": false,
+      "passes": true,
       "notes": "Requires reading each verifier type's test.sh to understand what inputs it expects. For IR_checklist: needs solution.md with file paths. For checklist/pattern: needs code changes matching keyword patterns. For diff_similarity: needs a git diff matching expected.diff. For find_and_prove: needs a test file. For similarity: needs analysis output matching keywords."
     },
     {
diff --git a/ralph-eval/progress.txt b/ralph-eval/progress.txt
@@ -228,3 +228,16 @@
   - Exit code 1 triggered by `fixture_score_mismatch` CRITICAL issues; non-idempotent is WARNING (non-blocking)
   - aspnetcore-code-review-001 fixture test: perfect_input=1.00, empty_input=0.00 PASS
 ---
+
+## 2026-02-20T - Final closeout (US-011 through US-014)
+- Verified all 4 remaining stories were already implemented (commits on main since merge of ralph/unified-eval-package)
+- Fixed US-011 scatter data gap: added `scatter` key with `{task_id, file_recall, reward}` entries to `_compute_per_suite_correlation()` output in ir_analysis.py
+- US-012 DEBUG_MODE passthrough via `export` (Harbor inherits) — functionally equivalent to explicit `-e` flag
+- US-013: all 8 code review tasks have defect_type annotations (40 defect entries total)
+- US-014: 5 fixture directories covering all 5 verifier types confirmed present
+- Updated PRD: all 15 stories now `passes: true`
+- **Learnings for future iterations:**
+  - Ralph branch merge to main doesn't auto-update PRD `passes` fields — update PRD in same commit as implementation
+  - `retrieval_outcome_correlation()` in statistics.py only returns rho/p_value/effect_size — caller must add scatter data from its local context
+  - Harbor forwards exported env vars to containers implicitly — no explicit `-e` needed
+---
diff --git a/scripts/ir_analysis.py b/scripts/ir_analysis.py
@@ -938,10 +938,11 @@ def _compute_per_suite_correlation(
     if not manifest_rewards:
         return None
 
-    # Build parallel lists: ir_score, reward, suite_label
+    # Build parallel lists: ir_score, reward, suite_label, task_id
     ir_vals: list[float] = []
     reward_vals: list[float] = []
     suite_labels: list[str] = []
+    task_ids: list[str] = []
 
     for s in ir_scores:
         reward = manifest_rewards.get((s.task_id, s.config_name))
@@ -951,13 +952,20 @@ def _compute_per_suite_correlation(
         ir_vals.append(s.file_recall)
         reward_vals.append(reward)
         suite_labels.append(suite)
+        task_ids.append(s.task_id)
 
     if len(ir_vals) < 3:
         return None
 
     try:
         from ccb_metrics.statistics import retrieval_outcome_correlation
-        return retrieval_outcome_correlation(ir_vals, reward_vals, suite_labels)
+        result = retrieval_outcome_correlation(ir_vals, reward_vals, suite_labels)
+        # Add scatter data for JSON output (US-011 criterion 7)
+        result["scatter"] = [
+            {"task_id": task_ids[i], "file_recall": ir_vals[i], "reward": reward_vals[i]}
+            for i in range(len(ir_vals))
+        ]
+        return result
     except ImportError:
         return None