Skip to content

Commit 1f03f98

Browse files
committed
docs: add 7 new gotchas from 18 session archive reviews
New learnings extracted from unreviewed claude-archive sessions: - Gitignore: unanchored patterns match any directory level - MCP: correct Sourcegraph env var names - Harbor: token data in trajectory.json, reward.txt contract - Validation: LoCoBench task ID parsing with structural anchors - Dashboard: process handle persistence, st.dataframe preference, metric precision (4+ decimals) Also regenerates SCRIPT_INDEX and registry.json.
1 parent 611775f commit 1f03f98

File tree

5 files changed

+42
-129
lines changed

5 files changed

+42
-129
lines changed

AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
7777
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
7878
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
7980

8081
### Harbor Result Format
8182
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
8283
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
8384
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
8487

8588
### Validation / Scoring
8689
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8790
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8891
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
8992
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94+
95+
### Gitignore
96+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
9097

9198
### Git / Auth
9299
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
106113
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109119

110120
### LLM Judge
111121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.

CLAUDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
7777
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
7878
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
7980

8081
### Harbor Result Format
8182
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
8283
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
8384
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
8487

8588
### Validation / Scoring
8689
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8790
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8891
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
8992
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94+
95+
### Gitignore
96+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
9097

9198
### Git / Auth
9299
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
106113
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109119

110120
### LLM Judge
111121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.

docs/ops/ROOT_AGENT_GUIDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,17 +76,24 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7676
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
7777
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
7878
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
7980

8081
### Harbor Result Format
8182
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
8283
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
8384
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
8487

8588
### Validation / Scoring
8689
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
8790
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
8891
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
8992
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94+
95+
### Gitignore
96+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
9097

9198
### Git / Auth
9299
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
@@ -106,6 +113,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
106113
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109119

110120
### LLM Judge
111121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.

docs/ops/SCRIPT_INDEX.md

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
3232

3333
## Analysis & Comparison
3434

35-
- `scripts/analyze_harness_design.py` - Analysis/comparison script for analyze harness design.
3635
- `scripts/analyze_mcp_unique_haiku.py` - Analysis/comparison script for analyze mcp unique haiku.
37-
- `scripts/analyze_minimum_subset.py` - Analysis/comparison script for analyze minimum subset.
3836
- `scripts/analyze_paired_cost_official_raw.py` - Analysis/comparison script for analyze paired cost official raw.
39-
- `scripts/analyze_rq_power.py` - Analysis/comparison script for analyze rq power.
4037
- `scripts/analyze_run_coverage.py` - Analysis/comparison script for analyze run coverage.
41-
- `scripts/analyze_size_effects.py` - Analysis/comparison script for analyze size effects.
4238
- `scripts/audit_traces.py` - Analysis/comparison script for audit traces.
4339
- `scripts/compare_configs.py` - Compares benchmark outcomes across configs on matched task sets.
4440
- `scripts/comprehensive_analysis.py` - Analysis/comparison script for comprehensive analysis.
@@ -115,7 +111,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
115111

116112
## Infra & Mirrors
117113

118-
- `scripts/build_conversation_db.py` - Infrastructure or mirror management script for build conversation db.
119114
- `scripts/build_core_manifest.py` - Infrastructure or mirror management script for build core manifest.
120115
- `scripts/build_daytona_registry.py` - Infrastructure or mirror management script for build daytona registry.
121116
- `scripts/build_linux_base_images.sh` - Infrastructure or mirror management script for build linux base images.
@@ -176,6 +171,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
176171

177172
## Misc
178173

174+
- `scripts/account_health.py` - Utility script for account health.
179175
- `scripts/add_verification_metadata.py` - Utility script for add verification metadata.
180176
- `scripts/audit_gt_coverage.py` - Utility script for audit gt coverage.
181177
- `scripts/audit_official_scores.py` - Utility script for audit official scores.
@@ -188,8 +184,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
188184
- `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
189185
- `scripts/collect_repo_cloc.py` - Utility script for collect repo cloc.
190186
- `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
191-
- `scripts/compare_old_new_ground_truth.py` - Utility script for compare old new ground truth.
192-
- `scripts/compute_analysis_ir_metrics.py` - Utility script for compute analysis ir metrics.
193187
- `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
194188
- `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
195189
- `scripts/control_plane.py` - Utility script for control plane.
@@ -200,7 +194,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
200194
- `scripts/daytona_curator_runner.py` - Utility script for daytona curator runner.
201195
- `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
202196
- `scripts/daytona_runner.py` - Utility script for daytona runner.
203-
- `scripts/daytona_snapshot_cleanup.py` - Utility script for daytona snapshot cleanup.
204197
- `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
205198
- `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
206199
- `scripts/derive_n_repos.py` - Utility script for derive n repos.
@@ -209,8 +202,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
209202
- `scripts/doe_select_tasks.py` - Utility script for doe select tasks.
210203
- `scripts/ds_hybrid_retrieval.py` - Utility script for ds hybrid retrieval.
211204
- `scripts/ds_wrapper.sh` - Utility script for ds wrapper.
212-
- `scripts/export_conversation_blog_assets.py` - Utility script for export conversation blog assets.
213-
- `scripts/export_engineering_diary_assets.py` - Utility script for export engineering diary assets.
214205
- `scripts/export_official_results.py` - Utility script for export official results.
215206
- `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
216207
- `scripts/extract_build_diary.py` - Utility script for extract build diary.
@@ -235,8 +226,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
235226
- `scripts/plot_build_diary.py` - Utility script for plot build diary.
236227
- `scripts/plot_build_diary_supplementary.py` - Utility script for plot build diary supplementary.
237228
- `scripts/plot_build_narrative.py` - Utility script for plot build narrative.
238-
- `scripts/plot_conversation_blog_svgs.py` - Utility script for plot conversation blog svgs.
239-
- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
240229
- `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
241230
- `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
242231
- `scripts/promote_blocked.py` - Utility script for promote blocked.
@@ -257,8 +246,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
257246
- `scripts/run_judge.py` - Utility script for run judge.
258247
- `scripts/run_missing_oracles.sh` - Utility script for run missing oracles.
259248
- `scripts/run_scaling_gap_oracles.sh` - Utility script for run scaling gap oracles.
260-
- `scripts/run_sg_local.sh` - Utility script for run sg local.
261-
- `scripts/run_sg_validation.py` - Utility script for run sg validation.
262249
- `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
263250
- `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
264251
- `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.

0 commit comments

Comments
 (0)