docs: add learnings from Mar 18 JSONL sessions + nightly report #13

sjarmak · sjarmak · commit 7a60bf60c1ab · 2026-03-19T22:43:12.000-04:00
Mar 18 sessions reviewed (all workflow sessions):
- 643ba935: learnings extraction (Mar 17 sessions, already captured)
- d3d75c4a: nightly report #12 generation
- 01f568ae: PRD compound-2026-03-18.md (task metadata + DuckDB)
- b6b0c311: Ralph prd.json conversion (active PRD: task-metadata-duckdb)

New learnings added to ROOT_AGENT_GUIDE.md:
- claude_baseline_agent.py:31 LOCOBENCH_CLAUDE_MD_TEMPLATE hardcodes path
- generate_eval_report.py:147,1005 both have falsy mcp_mode or config_name bug
- export_official_results.py:45 DEFAULT_REPO_BLOB_BASE points to old org
- Condensed Mar 17 additions section to make room (now Mar 17-18)

New finding from today: compare_configs.py already exists (PRD US-011
should inspect before reimplementing); scaffold_contextbench_tasks.py:224
has FD leak in bash-embedded Python one-liner (invisible to Ruff).
diff --git a/AGENTS.md b/AGENTS.md
@@ -111,14 +111,15 @@ full operations manual.
 - Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
 - `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
 - **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
-- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147` `mcp_mode or config_name` falls through on empty string.
+- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147,1005` `mcp_mode or config_name` falls through on empty string (both sites).
 - `models.py` `from_dict()` mutates caller's dict via `.pop()`.
 
 ### Agent / Runner Robustness
 - **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
-- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
+- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`; add `URLError`/`socket.timeout`.
+- **LOCOBENCH path**: `claude_baseline_agent.py:31` `LOCOBENCH_CLAUDE_MD_TEMPLATE` hardcodes `/home/stephanie_jarmak/CodeScaleBench`; crash on other machines.
 - **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
-- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
+- **Runner cleanup**: No `trap` for temp dirs. `mktemp` failure (line 648) silently copies to CWD.
 - **`grep -P` macOS**: `run_selected_tasks.sh:726` + 12 task test.sh files silently fail on BSD grep. Use `sed -n` or POSIX alternatives.
 - **`_common.sh` sparse array**: `unset` + `pids=("${pids[@]}")` doesn't compact sparse arrays in Bash; gaps persist (lines 1344-1352).
 
@@ -149,13 +150,13 @@ full operations manual.
 - Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
 - Ralph: `prd.json` single-active; archive before overwrite. `prd-archive/` and `prd.json` not gitignored.
 
-### Scripts / Code Quality (Mar 17 additions)
-- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; fails on other machines.
-- `context_retrieval_agent.py:432,544,552,584` `shell=True` + "no allowlist" (line 429); injection risk.
-- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103,117,134`; use temp+rename.
-- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244,288`, `extract_v2_report_data.py:144,286`.
-- FD leaks: 17+ sites: `daytona_curator_runner.py:564`, `generate_csb_org_tasks.py:494`, `generate_promoted_verifiers.py:220`, `sync_oracle_files.py:50`, `validate_task_run.py:217`.
-- **Ruff** S603/S604, SIM115, BLE001 catch shell injection, FD leaks, bare excepts; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)` (fix manually). `sanitize_secrets.py`: per-file S105/S106 ignores (intentional fake keys).
+### Scripts / Code Quality (Mar 17-18 additions)
+- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; crash on other machines.
+- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
+- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
+- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
+- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → old org `CodeScaleBench`; links 404.
+- **Ruff** S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
 
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -111,14 +111,15 @@ full operations manual.
 - Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
 - `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
 - **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
-- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147` `mcp_mode or config_name` falls through on empty string.
+- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147,1005` `mcp_mode or config_name` falls through on empty string (both sites).
 - `models.py` `from_dict()` mutates caller's dict via `.pop()`.
 
 ### Agent / Runner Robustness
 - **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
-- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
+- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`; add `URLError`/`socket.timeout`.
+- **LOCOBENCH path**: `claude_baseline_agent.py:31` `LOCOBENCH_CLAUDE_MD_TEMPLATE` hardcodes `/home/stephanie_jarmak/CodeScaleBench`; crash on other machines.
 - **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
-- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
+- **Runner cleanup**: No `trap` for temp dirs. `mktemp` failure (line 648) silently copies to CWD.
 - **`grep -P` macOS**: `run_selected_tasks.sh:726` + 12 task test.sh files silently fail on BSD grep. Use `sed -n` or POSIX alternatives.
 - **`_common.sh` sparse array**: `unset` + `pids=("${pids[@]}")` doesn't compact sparse arrays in Bash; gaps persist (lines 1344-1352).
 
@@ -149,13 +150,13 @@ full operations manual.
 - Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
 - Ralph: `prd.json` single-active; archive before overwrite. `prd-archive/` and `prd.json` not gitignored.
 
-### Scripts / Code Quality (Mar 17 additions)
-- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; fails on other machines.
-- `context_retrieval_agent.py:432,544,552,584` `shell=True` + "no allowlist" (line 429); injection risk.
-- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103,117,134`; use temp+rename.
-- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244,288`, `extract_v2_report_data.py:144,286`.
-- FD leaks: 17+ sites: `daytona_curator_runner.py:564`, `generate_csb_org_tasks.py:494`, `generate_promoted_verifiers.py:220`, `sync_oracle_files.py:50`, `validate_task_run.py:217`.
-- **Ruff** S603/S604, SIM115, BLE001 catch shell injection, FD leaks, bare excepts; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)` (fix manually). `sanitize_secrets.py`: per-file S105/S106 ignores (intentional fake keys).
+### Scripts / Code Quality (Mar 17-18 additions)
+- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; crash on other machines.
+- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
+- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
+- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
+- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → old org `CodeScaleBench`; links 404.
+- **Ruff** S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
 
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
diff --git a/docs/ops/ROOT_AGENT_GUIDE.md b/docs/ops/ROOT_AGENT_GUIDE.md
@@ -111,14 +111,15 @@ full operations manual.
 - Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
 - `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
 - **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
-- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147` `mcp_mode or config_name` falls through on empty string.
+- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147,1005` `mcp_mode or config_name` falls through on empty string (both sites).
 - `models.py` `from_dict()` mutates caller's dict via `.pop()`.
 
 ### Agent / Runner Robustness
 - **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
-- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
+- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`; add `URLError`/`socket.timeout`.
+- **LOCOBENCH path**: `claude_baseline_agent.py:31` `LOCOBENCH_CLAUDE_MD_TEMPLATE` hardcodes `/home/stephanie_jarmak/CodeScaleBench`; crash on other machines.
 - **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
-- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
+- **Runner cleanup**: No `trap` for temp dirs. `mktemp` failure (line 648) silently copies to CWD.
 - **`grep -P` macOS**: `run_selected_tasks.sh:726` + 12 task test.sh files silently fail on BSD grep. Use `sed -n` or POSIX alternatives.
 - **`_common.sh` sparse array**: `unset` + `pids=("${pids[@]}")` doesn't compact sparse arrays in Bash; gaps persist (lines 1344-1352).
 
@@ -149,13 +150,13 @@ full operations manual.
 - Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
 - Ralph: `prd.json` single-active; archive before overwrite. `prd-archive/` and `prd.json` not gitignored.
 
-### Scripts / Code Quality (Mar 17 additions)
-- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; fails on other machines.
-- `context_retrieval_agent.py:432,544,552,584` `shell=True` + "no allowlist" (line 429); injection risk.
-- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103,117,134`; use temp+rename.
-- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244,288`, `extract_v2_report_data.py:144,286`.
-- FD leaks: 17+ sites: `daytona_curator_runner.py:564`, `generate_csb_org_tasks.py:494`, `generate_promoted_verifiers.py:220`, `sync_oracle_files.py:50`, `validate_task_run.py:217`.
-- **Ruff** S603/S604, SIM115, BLE001 catch shell injection, FD leaks, bare excepts; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)` (fix manually). `sanitize_secrets.py`: per-file S105/S106 ignores (intentional fake keys).
+### Scripts / Code Quality (Mar 17-18 additions)
+- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; crash on other machines.
+- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
+- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
+- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
+- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → old org `CodeScaleBench`; links 404.
+- **Ruff** S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
 
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
diff --git a/reports/nightly/2026-03-19-review.md b/reports/nightly/2026-03-19-review.md
@@ -0,0 +1,126 @@
+# Nightly Research Report — 2026-03-19 (Report #13)
+
+## Executive Summary
+
+Seven consecutive days with zero code fixes. This report reviews the four JSONL sessions from March 18, all of which were workflow sessions (no code changes, no new code bugs introduced). The most notable outcome is that the active PRD has been updated to target **task metadata auto-repair + DuckDB result store** (`ralph/task-metadata-duckdb-2026-03-18`, 12 stories). A codebase sweep today surfaced one new finding: `compare_configs.py` already exists (the PRD's Part 2 assumed it needed to be built from scratch), and `scaffold_contextbench_tasks.py` embeds a FD-leaking Python one-liner in a Bash subprocess that the prior FD-leak audit missed.
+
+---
+
+## 1. March 18 Session Review
+
+### Session: 643ba935 (Learnings Extraction)
+**What happened:** Standard learnings extraction session reviewing the four Mar 17 JSONL sessions. All findings were already captured from prior review cycles; no net-new content. Confirmed the ROOT_AGENT_GUIDE.md 12,288-byte limit workflow: new additions require targeted condensing, and every round of additions must be followed by `wc -c` to verify.
+
+**Key confirmation:** Ruff SIM115 cannot auto-fix `Popen(stdout=f)` sites — those require manual restructuring. `sanitize_secrets.py` needs per-file `S105`/`S106` ignores because the file intentionally contains fake API key patterns for detection testing.
+
+---
+
+### Session: d3d75c4a (Nightly Report #12 Generation)
+**What happened:** Automated nightly research session. Agent read recent reports, ran parallel exploration agents across the codebase, and published report #12. All findings from this session are already documented in CLAUDE.md (added in this session's report).
+
+**Findings surfaced (already in CLAUDE.md):**
+- `claude_baseline_agent.py:31` — `LOCOBENCH_CLAUDE_MD_TEMPLATE` hardcoded to `/home/stephanie_jarmak/CodeScaleBench`
+- `export_official_results.py:45` — stale org URL `CodeScaleBench` → all exported links 404
+- `generate_eval_report.py:1005` — falsy bug repeats (previously only line 147 catalogued)
+- 274 tasks × 2 missing metadata fields = 548 undetected zero-result queries
+- MEMORY.md dashboard entry corrected (dashboard/app.py does not exist)
+
+---
+
+### Session: 01f568ae (PRD Writing)
+**What happened:** Wrote `tasks/prd-compound-2026-03-18.md` covering the recommended next feature from nightly report #12. Session duration: ~82 seconds of tool use. No code changes.
+
+**PRD structure (12 stories in 3 parts):**
+
+| Part | Stories | Scope |
+|------|---------|-------|
+| 1 — Task Metadata Repair | 6 | `repair_task_metadata.py`: infer verification_modes + use_case_category, atomic write, --validate, repo_health check |
+| 2 — DuckDB Result Store | 6 | `init_result_db.py`: schema init, run ingest, parquet seed, SQL CLI, compare_configs integration |
+| 3 — Integration Quality | 2 | Ruff compliance, unit tests for inference + ingest |
+
+**Inference logic specified in PRD:**
+- `test.sh` sources `dual_score_lib.sh` → `verification_modes: ["direct", "artifact"]`
+- `test.sh` sources `answer_json_verifier_lib.sh` → `verification_modes: ["artifact"]`
+- Neither → `verification_modes: ["direct"]`
+- `use_case_category`: strip `csb_sdlc_` or `csb_org_` prefix from suite directory name
+
+---
+
+### Session: b6b0c311 (Ralph PRD Conversion)
+**What happened:** Converted `prd-compound-2026-03-18.md` to `prd.json` using the Ralph skill. Correctly archived the previous code-quality-gate PRD to `prd-archive/prd-code-quality-gate-2026-03-17.json` before overwriting. No issues.
+
+**Active PRD:** `ralph/task-metadata-duckdb-2026-03-18`, 12 user stories ordered by dependency (US-001 through US-012).
+
+---
+
+## 2. New Findings from Today's Investigation
+
+### 2.1 compare_configs.py Already Exists
+
+`scripts/compare_configs.py` is a fully functional script for comparing benchmark run configurations. The PRD (Part 2, Story 2.5) describes building this script, but it already exists.
+
+**Impact:** Part 2 of the PRD needs to be revised before implementation begins. The existing script may already satisfy or partially satisfy Story 2.5 (`compare_configs.py` — DuckDB-backed delta report between two run IDs). The Ralph agent should inspect the existing file before implementing Story 2.5 to avoid a duplicate.
+
+**Action:** Before starting Story 2.5, read `scripts/compare_configs.py` and assess whether it needs a DuckDB integration layer or a full rewrite.
+
+---
+
+### 2.2 FD Leak in Embedded Python Subprocess (scaffold_contextbench_tasks.py)
+
+`scripts/scaffold_contextbench_tasks.py:224` contains an embedded Python one-liner inside a Bash variable assignment:
+
+```bash
+REWARD=$(python3 -c "import json; print(json.load(open('/logs/verifier/reward.json')).get('reward', 0.0))")
+```
+
+This `open()` call inside the `-c` string is never closed. While the process exits immediately (limiting real-world impact), this is the same pattern tracked under the FD leak bug category (S603/SIM115 rules). It was missed by the prior FD leak audit because it is embedded inside a Bash script rather than appearing in a standalone `.py` file — Ruff only scans `.py` files.
+
+**Pattern note:** FD leaks in bash-embedded Python one-liners (`python3 -c "... open(...) ..."`) are invisible to Ruff. The code quality gate PRD should add a grep-based check or a bash linter for this pattern.
+
+---
+
+### 2.3 repair_task_metadata.py and init_result_db.py Do Not Exist
+
+Confirmed: neither `scripts/repair_task_metadata.py` nor `scripts/init_result_db.py` exists yet. The PRD is the authoritative spec; no partial implementation has been started.
+
+---
+
+## 3. Architecture Status
+
+| Component | Status |
+|-----------|--------|
+| `scripts/repair_task_metadata.py` | Not yet built (PRD US-001 through US-006) |
+| `scripts/init_result_db.py` | Not yet built (PRD US-005 through US-010) |
+| `scripts/compare_configs.py` | EXISTS — inspect before implementing PRD US-011 |
+| `data/contextbench/*.parquet` | Exist; DuckDB seed path is ready |
+| `benchmarks/_shared/` | Does not exist; verifier lib deduplication not yet started |
+| `pyproject.toml` | Does not exist; Ruff/pre-commit not yet configured |
+
+---
+
+## 4. Issues Added to CLAUDE.md This Session
+
+- `claude_baseline_agent.py:31` `LOCOBENCH_CLAUDE_MD_TEMPLATE` hardcodes path (moved from MEMORY.md to ROOT_AGENT_GUIDE.md)
+- `generate_eval_report.py:147,1005` — both sites have the falsy bug (`:1005` was missing from prior CLAUDE.md entry)
+- `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → stale `CodeScaleBench` org; links 404
+- ROOT_AGENT_GUIDE.md byte limit workflow documented: condense existing content, verify with `wc -c`
+
+---
+
+## 5. Recommended Next Action
+
+The Ralph agent should begin work on the active PRD (`ralph/task-metadata-duckdb-2026-03-18`). Story order:
+
+1. **US-001**: `repair_task_metadata.py` with `--dry-run` and inference logic
+2. **US-002**: Atomic write mode
+3. **US-003**: `--validate` mode
+4. **US-004**: `task_metadata_complete` check in `repo_health.py`
+5. **US-005**: `init_result_db.py --init` (DuckDB schema)
+6. **US-006 through US-010**: Ingest, query, seed from parquet
+7. **US-011**: Check existing `compare_configs.py` first — may only need DuckDB layer added
+
+**Before starting US-011**, read `scripts/compare_configs.py` to assess scope.
+
+---
+
+*Remediation velocity: 7 consecutive days without a code fix (Mar 12 → Mar 19). ~100 open issues across 13 reports. Active PRD: `ralph/task-metadata-duckdb-2026-03-18` (12 stories).*