bench: record Phase 0 run for task-001-tags (both apps PASS) by suleimansh · Pull Request #82 · gemstack-land/gemstack

suleimansh · 2026-06-28T13:34:26Z

First end-to-end Phase 0 run of the "our AI vs Next.js" benchmark (#75, #78).

What this is

The actual measurement run on top of the merged Phase 0 harness. Not new product code, just recorded results plus the agents' solution patches. The baseline apps stay pristine (the Phase 0 starting point); only benchmarks/runs/ is added.

Method

One autonomous AI coding agent per app, same agent type and model both sides (fair-harness rule), same task spec verbatim, no human in the loop.
Harness validated first: both committed baselines fail the gate identically (same 5 checks) -> a real symmetric gap.
Independently re-verified by hand against fresh DBs + restarted servers after the agents reported PASS.

Results

App	Status	Time	Interventions
bench-app-next (vanilla Next.js)	PASS	139s	0
bench-app-gemstack (Vike + @gemstack/ai-sdk)	PASS	145s	0

All 14 contract checks pass on both. Diffs comparable (~130-150 lines each).

Reading

The rubric runs end to end and the gate is objective and reproducible, so Phase 0's goal is met. A plain CRUD-extension task does not exercise the orchestration layer, so it does not differentiate here (both fast, zero interventions). The intervention metric will only show signal on AI-integration / multi-step / refactor tasks. Phase 1 (#79) should weight the task set toward those.

See benchmarks/runs/2026-06-28-task-001-tags/results.md for the full log and reproduce steps.

First end-to-end Phase 0 run of the AI benchmark (#75). Both baselines fail the gate identically; both agents reach PASS autonomously with zero interventions (Next 139s, GemStack 145s). Independently re-verified against fresh DBs. Records results.md plus the agents' solution patches; the baseline apps stay pristine.

suleimansh · 2026-06-28T14:03:21Z

Not merging. The benchmark is relocating to its real home, suleimansh/vike-data, where the extension family it now measures (vike-auth, vike-data, vike-notifications, vike-stripe) actually lives. This Phase 0 run proved the method (harness + rubric work; a plain CRUD task doesn't differentiate) — that learning carries over to the vike-data benchmark. Closing here rather than landing an AI-only run log on a repo the benchmark is leaving.

suleimansh added enhancement New feature or request priority: medium Worth doing, not urgent labels Jun 28, 2026

suleimansh self-assigned this Jun 28, 2026

suleimansh closed this Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench: record Phase 0 run for task-001-tags (both apps PASS)#82

bench: record Phase 0 run for task-001-tags (both apps PASS)#82
suleimansh wants to merge 1 commit into
mainfrom
bench/phase-0-task-001-run

suleimansh commented Jun 28, 2026

Uh oh!

suleimansh commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

suleimansh commented Jun 28, 2026

What this is

Method

Results

Reading

Uh oh!

suleimansh commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant