You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add 7 new gotchas from 18 session archive reviews
New learnings extracted from unreviewed claude-archive sessions:
- Gitignore: unanchored patterns match any directory level
- MCP: correct Sourcegraph env var names
- Harbor: token data in trajectory.json, reward.txt contract
- Validation: LoCoBench task ID parsing with structural anchors
- Dashboard: process handle persistence, st.dataframe preference,
metric precision (4+ decimals)
Also regenerates SCRIPT_INDEX and registry.json.
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77
77
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78
78
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79
+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
79
80
80
81
### Harbor Result Format
81
82
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
82
83
-`trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
83
84
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85
+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86
+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
84
87
85
88
### Validation / Scoring
86
89
-`validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
87
90
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
88
91
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89
92
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93
+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94
+
95
+
### Gitignore
96
+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
90
97
91
98
### Git / Auth
92
99
-`gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
-`st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107
114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108
115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116
+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117
+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118
+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109
119
110
120
### LLM Judge
111
121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77
77
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78
78
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79
+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
79
80
80
81
### Harbor Result Format
81
82
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
82
83
-`trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
83
84
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85
+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86
+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
84
87
85
88
### Validation / Scoring
86
89
-`validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
87
90
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
88
91
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89
92
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93
+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94
+
95
+
### Gitignore
96
+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
90
97
91
98
### Git / Auth
92
99
-`gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
-`st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107
114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108
115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116
+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117
+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118
+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109
119
110
120
### LLM Judge
111
121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
77
77
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP endpoints. HTTP 405 from the endpoint means it exists but requires stdio.
78
78
- Sourcegraph skills installed via `npx -y skills add` show empty `"skills": []` in headless/containerized mode. Embed skill prompt content directly in the task's CLAUDE.md instead.
79
+
- Sourcegraph MCP env vars are `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN`. Do NOT use `SOURCEGRAPH_ENDPOINT` or `SOURCEGRAPH_TOKEN` -- those are wrong variable names.
79
80
80
81
### Harbor Result Format
81
82
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
82
83
-`trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
83
84
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85
+
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86
+
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
84
87
85
88
### Validation / Scoring
86
89
-`validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
87
90
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
88
91
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
89
92
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93
+
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94
+
95
+
### Gitignore
96
+
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
90
97
91
98
### Git / Auth
92
99
-`gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
-`st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
107
114
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
108
115
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116
+
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117
+
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118
+
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
109
119
110
120
### LLM Judge
111
121
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
0 commit comments