Skip to content

bench: record Phase 0 run for task-001-tags (both apps PASS)#82

Closed
suleimansh wants to merge 1 commit into
mainfrom
bench/phase-0-task-001-run
Closed

bench: record Phase 0 run for task-001-tags (both apps PASS)#82
suleimansh wants to merge 1 commit into
mainfrom
bench/phase-0-task-001-run

Conversation

@suleimansh

Copy link
Copy Markdown
Member

First end-to-end Phase 0 run of the "our AI vs Next.js" benchmark (#75, #78).

What this is

The actual measurement run on top of the merged Phase 0 harness. Not new product code, just recorded results plus the agents' solution patches. The baseline apps stay pristine (the Phase 0 starting point); only benchmarks/runs/ is added.

Method

  • One autonomous AI coding agent per app, same agent type and model both sides (fair-harness rule), same task spec verbatim, no human in the loop.
  • Harness validated first: both committed baselines fail the gate identically (same 5 checks) -> a real symmetric gap.
  • Independently re-verified by hand against fresh DBs + restarted servers after the agents reported PASS.

Results

App Status Time Interventions
bench-app-next (vanilla Next.js) PASS 139s 0
bench-app-gemstack (Vike + @gemstack/ai-sdk) PASS 145s 0

All 14 contract checks pass on both. Diffs comparable (~130-150 lines each).

Reading

The rubric runs end to end and the gate is objective and reproducible, so Phase 0's goal is met. A plain CRUD-extension task does not exercise the orchestration layer, so it does not differentiate here (both fast, zero interventions). The intervention metric will only show signal on AI-integration / multi-step / refactor tasks. Phase 1 (#79) should weight the task set toward those.

See benchmarks/runs/2026-06-28-task-001-tags/results.md for the full log and reproduce steps.

First end-to-end Phase 0 run of the AI benchmark (#75). Both baselines
fail the gate identically; both agents reach PASS autonomously with zero
interventions (Next 139s, GemStack 145s). Independently re-verified against
fresh DBs. Records results.md plus the agents' solution patches; the baseline
apps stay pristine.
@suleimansh suleimansh added enhancement New feature or request priority: medium Worth doing, not urgent labels Jun 28, 2026
@suleimansh suleimansh self-assigned this Jun 28, 2026
@suleimansh

Copy link
Copy Markdown
Member Author

Not merging. The benchmark is relocating to its real home, suleimansh/vike-data, where the extension family it now measures (vike-auth, vike-data, vike-notifications, vike-stripe) actually lives. This Phase 0 run proved the method (harness + rubric work; a plain CRUD task doesn't differentiate) — that learning carries over to the vike-data benchmark. Closing here rather than landing an AI-only run log on a repo the benchmark is leaving.

@suleimansh suleimansh closed this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request priority: medium Worth doing, not urgent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant