Skip to content

Commit 3c00eae

Browse files
bkataruclaude
andcommitted
Update website: enhanced bench suite, new trial findings
- Architecture: bench now measures 6 dimensions, supports JSON + raw llama-bench - Projects/gilgamesh: updated bench tool usage examples - Roadmap: mark 4B Q4_K_M trial and bench enhancements as complete Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a9de20d commit 3c00eae

3 files changed

Lines changed: 21 additions & 11 deletions

File tree

content/architecture/_index.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -142,13 +142,18 @@ POST /api/chat → SSE stream of agent events:
142142

143143
## Benchmarking Infrastructure
144144

145-
Gilgamesh includes a pure Go benchmark suite (`cmd/bench/`) for trialing local models. It measures five dimensions:
145+
Gilgamesh includes a pure Go benchmark suite (`cmd/bench/`) for trialing local models. It loads profiles from `gilgamesh.json`, integrates with llama-bench for raw inference metrics, and supports JSON output for historical tracking.
146+
147+
It measures six dimensions:
146148

147149
1. **Health check** &mdash; endpoint latency
148-
2. **Minimal prompt** &mdash; TTFT + generation speed (tok/s)
149-
3. **Tool call** &mdash; can the model emit valid tool calls?
150-
4. **One-shot** &mdash; end-to-end gilgamesh `run` response
151-
5. **Edit task** &mdash; full agent loop: create file + edit it
150+
2. **Raw inference** &mdash; llama-bench pp/tg tok/s (auto-detects binaries in `local-ai/bin/`)
151+
3. **Minimal prompt** &mdash; TTFT + generation speed via API
152+
4. **Tool call** &mdash; can the model emit valid tool calls?
153+
5. **One-shot** &mdash; end-to-end gilgamesh `run` response
154+
6. **Edit task** &mdash; full agent loop: create file + edit it
155+
156+
Supports `-all` (compare all profiles), `-raw` (raw llama-bench), `-json` (machine-readable), `-save` (append to JSON log).
152157

153158
Results and ongoing findings are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md). The quest: find the optimal model + quantization + inference parameters for a responsive, reliable, tool-calling agent running entirely on CPU.
154159

content/projects/gilgamesh.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -111,15 +111,17 @@ At 181 tok/s prompt processing (Qwen3.5-2B Q4_K_M, 16 threads), the first respon
111111

112112
## Benchmarking &amp; Model Trials
113113

114-
Gilgamesh includes a pure Go benchmark tool for trialing local models:
114+
Gilgamesh includes a pure Go benchmark suite for trialing local models. It loads profiles from config, integrates with llama-bench, and supports JSON output for historical tracking:
115115

116116
```bash
117-
go run ./cmd/bench # benchmark default endpoint
118-
go run ./cmd/bench -all # benchmark all reachable endpoints
119-
go run ./cmd/bench -model heavy # benchmark specific profile
117+
go run ./cmd/bench # benchmark active profile from config
118+
go run ./cmd/bench -all # benchmark all profiles + summary table
119+
go run ./cmd/bench -raw # include raw llama-bench pp/tg metrics
120+
go run ./cmd/bench -json # JSON output for scripting
121+
go run ./cmd/bench -save r.json # append to JSON log for tracking
120122
```
121123

122-
Measures health latency, prompt speed (TTFT + tok/s), tool call parsing, one-shot agent response, and full edit task quality. Results are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md).
124+
Measures 6 dimensions: health, raw inference (pp/tg tok/s), minimal prompt, tool call parsing, one-shot agent, and full edit task. Results are tracked in [`TRIALS.md`](https://github.com/godsfromthemachine/gilgamesh/blob/main/TRIALS.md).
123125

124126
### Key Findings
125127

content/roadmap/_index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,10 @@ description: "Project milestones, phases, and future plans"
9595
<li class="done">Go benchmark tool (cmd/bench/main.go)</li>
9696
<li class="done">Baseline benchmarks: Qwen3.5 2B Q4_K_M, 4B Q8_0</li>
9797
<li class="done">Key findings documented (2B sweet spot, 0.8B rejected)</li>
98-
<li class="todo">Trial Qwen3.5-4B Q4_K_M &mdash; faster 4B option</li>
98+
<li class="done">Enhanced bench suite: config loading, raw llama-bench, JSON output, result persistence</li>
99+
<li class="done">4B Q4_K_M raw trial &mdash; same speed as Q8_0, saves 1.6GB disk</li>
100+
<li class="done">Full -all comparison run: 2B vs 4B with summary table</li>
101+
<li class="todo">4B Q4_K_M agent benchmarks &mdash; tool calling reliability vs Q8_0</li>
99102
<li class="todo">Trial IQ4_XS / IQ3_M quants &mdash; smaller memory footprint</li>
100103
<li class="todo">Context length and thread count tuning</li>
101104
<li class="todo">New model families &mdash; Phi-4, Gemma 3, others</li>

0 commit comments

Comments
 (0)