Skip to content

Commit 9d385e4

Browse files
sjarmakclaude
andcommitted
feat: close out ralph-eval — all 15 stories passing
Fix US-011 scatter data gap: add task_id/file_recall/reward scatter points to _compute_per_suite_correlation() JSON output in ir_analysis.py. Update PRD to mark US-011 through US-014 as passes:true — all were implemented in prior commits but PRD was never updated during merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c9279ab commit 9d385e4

File tree

3 files changed

+27
-6
lines changed

3 files changed

+27
-6
lines changed

ralph-eval/prd.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@
203203
"python3 scripts/ir_analysis.py runs without errors (existing behavior preserved when no new flags used)"
204204
],
205205
"priority": 11,
206-
"passes": false,
206+
"passes": true,
207207
"notes": "Depends on US-008 (statistics module). Confidence data comes from ground_truth_files.json which already has a confidence field per task. Load it alongside the GT registry."
208208
},
209209
{
@@ -221,7 +221,7 @@
221221
"Debug capture adds < 2 seconds to verifier execution"
222222
],
223223
"priority": 12,
224-
"passes": false,
224+
"passes": true,
225225
"notes": "Independent of judge stories. Find verifier_lib.sh location with: grep -r 'verifier_lib' benchmarks/ — it may be in benchmarks/lib/ or copied into each task's tests/. The env filter should use grep -vE 'KEY|TOKEN|SECRET|PASSWORD' to avoid leaking credentials."
226226
},
227227
{
@@ -237,7 +237,7 @@
237237
"python3 scripts/ccb_metrics/ground_truth.py runs without errors and newly extracted defect_type fields appear in output"
238238
],
239239
"priority": 13,
240-
"passes": false,
240+
"passes": true,
241241
"notes": "Find code review tasks: grep -rl 'expected_defects' benchmarks/ccb_test/. Defect types should be inferred from the defect descriptions in each file. This is a data enrichment story, not a scoring change."
242242
},
243243
{
@@ -254,7 +254,7 @@
254254
"A README.md in docs/ or inline in the fixture dirs explains the format for future task authors"
255255
],
256256
"priority": 14,
257-
"passes": false,
257+
"passes": true,
258258
"notes": "Requires reading each verifier type's test.sh to understand what inputs it expects. For IR_checklist: needs solution.md with file paths. For checklist/pattern: needs code changes matching keyword patterns. For diff_similarity: needs a git diff matching expected.diff. For find_and_prove: needs a test file. For similarity: needs analysis output matching keywords."
259259
},
260260
{

ralph-eval/progress.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,3 +228,16 @@
228228
- Exit code 1 triggered by `fixture_score_mismatch` CRITICAL issues; non-idempotent is WARNING (non-blocking)
229229
- aspnetcore-code-review-001 fixture test: perfect_input=1.00, empty_input=0.00 PASS
230230
---
231+
232+
## 2026-02-20T - Final closeout (US-011 through US-014)
233+
- Verified all 4 remaining stories were already implemented (commits on main since merge of ralph/unified-eval-package)
234+
- Fixed US-011 scatter data gap: added `scatter` key with `{task_id, file_recall, reward}` entries to `_compute_per_suite_correlation()` output in ir_analysis.py
235+
- US-012 DEBUG_MODE passthrough via `export` (Harbor inherits) — functionally equivalent to explicit `-e` flag
236+
- US-013: all 8 code review tasks have defect_type annotations (40 defect entries total)
237+
- US-014: 5 fixture directories covering all 5 verifier types confirmed present
238+
- Updated PRD: all 15 stories now `passes: true`
239+
- **Learnings for future iterations:**
240+
- Ralph branch merge to main doesn't auto-update PRD `passes` fields — update PRD in same commit as implementation
241+
- `retrieval_outcome_correlation()` in statistics.py only returns rho/p_value/effect_size — caller must add scatter data from its local context
242+
- Harbor forwards exported env vars to containers implicitly — no explicit `-e` needed
243+
---

scripts/ir_analysis.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -938,10 +938,11 @@ def _compute_per_suite_correlation(
938938
if not manifest_rewards:
939939
return None
940940

941-
# Build parallel lists: ir_score, reward, suite_label
941+
# Build parallel lists: ir_score, reward, suite_label, task_id
942942
ir_vals: list[float] = []
943943
reward_vals: list[float] = []
944944
suite_labels: list[str] = []
945+
task_ids: list[str] = []
945946

946947
for s in ir_scores:
947948
reward = manifest_rewards.get((s.task_id, s.config_name))
@@ -951,13 +952,20 @@ def _compute_per_suite_correlation(
951952
ir_vals.append(s.file_recall)
952953
reward_vals.append(reward)
953954
suite_labels.append(suite)
955+
task_ids.append(s.task_id)
954956

955957
if len(ir_vals) < 3:
956958
return None
957959

958960
try:
959961
from ccb_metrics.statistics import retrieval_outcome_correlation
960-
return retrieval_outcome_correlation(ir_vals, reward_vals, suite_labels)
962+
result = retrieval_outcome_correlation(ir_vals, reward_vals, suite_labels)
963+
# Add scatter data for JSON output (US-011 criterion 7)
964+
result["scatter"] = [
965+
{"task_id": task_ids[i], "file_recall": ir_vals[i], "reward": reward_vals[i]}
966+
for i in range(len(ir_vals))
967+
]
968+
return result
961969
except ImportError:
962970
return None
963971

0 commit comments

Comments
 (0)