docs: add 7 new gotchas from 18 session archive reviews

sjarmak · sjarmak · commit 1f03f980d93d · 2026-03-08T22:36:15.000-04:00
New learnings extracted from unreviewed claude-archive sessions:
- Gitignore: unanchored patterns match any directory level
- MCP: correct Sourcegraph env var names
- Harbor: token data in trajectory.json, reward.txt contract
- Validation: LoCoBench task ID parsing with structural anchors
- Dashboard: process handle persistence, st.dataframe preference,
  metric precision (4+ decimals)

Also regenerates SCRIPT_INDEX and registry.json.
diff --git a/AGENTS.md b/AGENTS.md
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
 - Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
 - Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
+- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
 
 ### Harbor Result Format
 - Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
 - `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
 - SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
+- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
+- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
 
 ### Validation / Scoring
 - `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
 - Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
 - Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
 - Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
+- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
+
+### Gitignore
+- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
 
 ### Git / Auth
 - `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
 - Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
 - Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
+- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
+- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
+- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
 
 ### LLM Judge
 - Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
 - Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
 - Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
+- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
 
 ### Harbor Result Format
 - Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
 - `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
 - SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
+- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
+- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
 
 ### Validation / Scoring
 - `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
 - Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
 - Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
 - Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
+- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
+
+### Gitignore
+- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
 
 ### Git / Auth
 - `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
 - Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
 - Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
+- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
+- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
+- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
 
 ### LLM Judge
 - Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
diff --git a/docs/ops/ROOT_AGENT_GUIDE.md b/docs/ops/ROOT_AGENT_GUIDE.md
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
 - Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
 - Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
+- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
 
 ### Harbor Result Format
 - Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
 - `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
 - SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
+- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
+- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
 
 ### Validation / Scoring
 - `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
 - Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
 - Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
 - Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
+- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
+
+### Gitignore
+- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
 
 ### Git / Auth
 - `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
 - Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
 - Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
+- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
+- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
+- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
 
 ### LLM Judge
 - Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
diff --git a/docs/ops/SCRIPT_INDEX.md b/docs/ops/SCRIPT_INDEX.md
@@ -32,13 +32,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 
 ## Analysis & Comparison
 
-- `scripts/analyze_harness_design.py` - Analysis/comparison script for analyze harness design.
 - `scripts/analyze_mcp_unique_haiku.py` - Analysis/comparison script for analyze mcp unique haiku.
-- `scripts/analyze_minimum_subset.py` - Analysis/comparison script for analyze minimum subset.
 - `scripts/analyze_paired_cost_official_raw.py` - Analysis/comparison script for analyze paired cost official raw.
-- `scripts/analyze_rq_power.py` - Analysis/comparison script for analyze rq power.
 - `scripts/analyze_run_coverage.py` - Analysis/comparison script for analyze run coverage.
-- `scripts/analyze_size_effects.py` - Analysis/comparison script for analyze size effects.
 - `scripts/audit_traces.py` - Analysis/comparison script for audit traces.
 - `scripts/compare_configs.py` - Compares benchmark outcomes across configs on matched task sets.
 - `scripts/comprehensive_analysis.py` - Analysis/comparison script for comprehensive analysis.
@@ -115,7 +111,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 
 ## Infra & Mirrors
 
-- `scripts/build_conversation_db.py` - Infrastructure or mirror management script for build conversation db.
 - `scripts/build_core_manifest.py` - Infrastructure or mirror management script for build core manifest.
 - `scripts/build_daytona_registry.py` - Infrastructure or mirror management script for build daytona registry.
 - `scripts/build_linux_base_images.sh` - Infrastructure or mirror management script for build linux base images.
@@ -176,6 +171,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 
 ## Misc
 
+- `scripts/account_health.py` - Utility script for account health.
 - `scripts/add_verification_metadata.py` - Utility script for add verification metadata.
 - `scripts/audit_gt_coverage.py` - Utility script for audit gt coverage.
 - `scripts/audit_official_scores.py` - Utility script for audit official scores.
@@ -188,8 +184,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
 - `scripts/collect_repo_cloc.py` - Utility script for collect repo cloc.
 - `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
-- `scripts/compare_old_new_ground_truth.py` - Utility script for compare old new ground truth.
-- `scripts/compute_analysis_ir_metrics.py` - Utility script for compute analysis ir metrics.
 - `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
 - `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
 - `scripts/control_plane.py` - Utility script for control plane.
@@ -200,7 +194,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/daytona_curator_runner.py` - Utility script for daytona curator runner.
 - `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
 - `scripts/daytona_runner.py` - Utility script for daytona runner.
-- `scripts/daytona_snapshot_cleanup.py` - Utility script for daytona snapshot cleanup.
 - `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
 - `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
 - `scripts/derive_n_repos.py` - Utility script for derive n repos.
@@ -209,8 +202,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/doe_select_tasks.py` - Utility script for doe select tasks.
 - `scripts/ds_hybrid_retrieval.py` - Utility script for ds hybrid retrieval.
 - `scripts/ds_wrapper.sh` - Utility script for ds wrapper.
-- `scripts/export_conversation_blog_assets.py` - Utility script for export conversation blog assets.
-- `scripts/export_engineering_diary_assets.py` - Utility script for export engineering diary assets.
 - `scripts/export_official_results.py` - Utility script for export official results.
 - `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
 - `scripts/extract_build_diary.py` - Utility script for extract build diary.
@@ -235,8 +226,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/plot_build_diary.py` - Utility script for plot build diary.
 - `scripts/plot_build_diary_supplementary.py` - Utility script for plot build diary supplementary.
 - `scripts/plot_build_narrative.py` - Utility script for plot build narrative.
-- `scripts/plot_conversation_blog_svgs.py` - Utility script for plot conversation blog svgs.
-- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
 - `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
 - `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
 - `scripts/promote_blocked.py` - Utility script for promote blocked.
@@ -257,8 +246,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/run_judge.py` - Utility script for run judge.
 - `scripts/run_missing_oracles.sh` - Utility script for run missing oracles.
 - `scripts/run_scaling_gap_oracles.sh` - Utility script for run scaling gap oracles.
-- `scripts/run_sg_local.sh` - Utility script for run sg local.
-- `scripts/run_sg_validation.py` - Utility script for run sg validation.
 - `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
 - `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
 - `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.
diff --git a/scripts/registry.json b/scripts/registry.json