Update website: enhanced bench suite, new trial findings

bkataru · claude · bkataru · commit 3c00eae9c956 · 2026-03-07T00:33:23.000Z
- Architecture: bench now measures 6 dimensions, supports JSON + raw llama-bench
- Projects/gilgamesh: updated bench tool usage examples
- Roadmap: mark 4B Q4_K_M trial and bench enhancements as complete

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/content/architecture/_index.md b/content/architecture/_index.md
@@ -142,13 +142,18 @@ POST /api/chat            → SSE stream of agent events:
 
 ## Benchmarking Infrastructure
 
-Gilgamesh includes a pure Go benchmark suite (`cmd/bench/`) for trialing local models. It measures five dimensions:
+Gilgamesh includes a pure Go benchmark suite (`cmd/bench/`) for trialing local models. It loads profiles from `gilgamesh.json`, integrates with llama-bench for raw inference metrics, and supports JSON output for historical tracking.
+
+It measures six dimensions:
 
 1. **Health check** &mdash; endpoint latency
-2. **Minimal prompt** &mdash; TTFT + generation speed (tok/s)
-3. **Tool call** &mdash; can the model emit valid tool calls?
-4. **One-shot** &mdash; end-to-end gilgamesh `run` response
-5. **Edit task** &mdash; full agent loop: create file + edit it
+2. **Raw inference** &mdash; llama-bench pp/tg tok/s (auto-detects binaries in `local-ai/bin/`)
+3. **Minimal prompt** &mdash; TTFT + generation speed via API
+4. **Tool call** &mdash; can the model emit valid tool calls?
+5. **One-shot** &mdash; end-to-end gilgamesh `run` response
+6. **Edit task** &mdash; full agent loop: create file + edit it
+
+Supports `-all` (compare all profiles), `-raw` (raw llama-bench), `-json` (machine-readable), `-save` (append to JSON log).
 
 Results and ongoing findings are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md). The quest: find the optimal model + quantization + inference parameters for a responsive, reliable, tool-calling agent running entirely on CPU.
 
diff --git a/content/projects/gilgamesh.md b/content/projects/gilgamesh.md
@@ -111,15 +111,17 @@ At 181 tok/s prompt processing (Qwen3.5-2B Q4_K_M, 16 threads), the first respon
 
 ## Benchmarking &amp; Model Trials
 
-Gilgamesh includes a pure Go benchmark tool for trialing local models:
+Gilgamesh includes a pure Go benchmark suite for trialing local models. It loads profiles from config, integrates with llama-bench, and supports JSON output for historical tracking:
 
 ```bash
-go run ./cmd/bench              # benchmark default endpoint
-go run ./cmd/bench -all         # benchmark all reachable endpoints
-go run ./cmd/bench -model heavy # benchmark specific profile
+go run ./cmd/bench              # benchmark active profile from config
+go run ./cmd/bench -all         # benchmark all profiles + summary table
+go run ./cmd/bench -raw         # include raw llama-bench pp/tg metrics
+go run ./cmd/bench -json        # JSON output for scripting
+go run ./cmd/bench -save r.json # append to JSON log for tracking
 ```
 
-Measures health latency, prompt speed (TTFT + tok/s), tool call parsing, one-shot agent response, and full edit task quality. Results are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md).
+Measures 6 dimensions: health, raw inference (pp/tg tok/s), minimal prompt, tool call parsing, one-shot agent, and full edit task. Results are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md).
 
 ### Key Findings
 
diff --git a/content/roadmap/_index.md b/content/roadmap/_index.md
@@ -95,7 +95,10 @@ description: "Project milestones, phases, and future plans"
     <li class="done">Go benchmark tool (cmd/bench/main.go)</li>
     <li class="done">Baseline benchmarks: Qwen3.5 2B Q4_K_M, 4B Q8_0</li>
     <li class="done">Key findings documented (2B sweet spot, 0.8B rejected)</li>
-    <li class="todo">Trial Qwen3.5-4B Q4_K_M &mdash; faster 4B option</li>
+    <li class="done">Enhanced bench suite: config loading, raw llama-bench, JSON output, result persistence</li>
+    <li class="done">4B Q4_K_M raw trial &mdash; same speed as Q8_0, saves 1.6GB disk</li>
+    <li class="done">Full -all comparison run: 2B vs 4B with summary table</li>
+    <li class="todo">4B Q4_K_M agent benchmarks &mdash; tool calling reliability vs Q8_0</li>
     <li class="todo">Trial IQ4_XS / IQ3_M quants &mdash; smaller memory footprint</li>
     <li class="todo">Context length and thread count tuning</li>
     <li class="todo">New model families &mdash; Phi-4, Gemma 3, others</li>