You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@ The script generates an OAuth URL — open it in your local browser, log in, pas
107
107
python3 scripts/generate_manifest.py
108
108
109
109
# Generate evaluation report
110
-
python3 scripts/generate_report.py
110
+
python3 scripts/generate_eval_report.py
111
111
112
112
# Select benchmark tasks
113
113
python3 scripts/select_benchmark_tasks.py
@@ -163,7 +163,7 @@ MAINTENANCE
163
163
|-------|--------|---------|
164
164
|`/compare-configs`|`scripts/compare_configs.py`| Show divergent tasks across baseline/SG_full, "MCP helps" vs "MCP hurts". Now includes optional MCP-conditioned analysis. |
165
165
|`/cost-report`|`scripts/cost_report.py`| Token usage and estimated cost by suite/config, most expensive tasks |
166
-
|`/generate-report`|`scripts/generate_report.py`| Aggregate CCB evaluation report from completed runs |
166
+
|`/generate-report`|`scripts/generate_eval_report.py`| Aggregate CCB evaluation report from completed runs |
167
167
|`/evaluate-traces`|`scripts/audit_traces.py`| Comprehensive trace evaluation: data integrity, output quality, efficiency analysis. Includes zero-MCP vs used-MCP classification. |
168
168
|`/mcp-audit`|`scripts/mcp_audit.py`| MCP usage patterns: used vs zero-MCP, intensity buckets, reward/time deltas conditioned on actual MCP adoption |
All runners support `--baseline-only` and `--full-only` flags.
166
166
167
167
**LinuxFLBench note:** Docker image build is slow (~10 min) due to Linux kernel partial clone (~2GB). Pre-build images before running to save time.
168
168
169
-
**DependEval note:** DependEval tasks use `--path` mode with local task directories. There is no unified `dependeval_3config.sh` yet; tasks are tracked via`configs/dependeval_selected_instances.json`.
169
+
**DependEval note:** DependEval runs use local task directories and are handled by`configs/dependeval_2config.sh`.
170
170
171
171
Requires [Harbor](https://github.com/laude-institute/harbor/tree/main) installed and configured with a Claude API key.
0 commit comments