bench: record Phase 0 run for task-001-tags (both apps PASS)#82
Closed
suleimansh wants to merge 1 commit into
Closed
bench: record Phase 0 run for task-001-tags (both apps PASS)#82suleimansh wants to merge 1 commit into
suleimansh wants to merge 1 commit into
Conversation
First end-to-end Phase 0 run of the AI benchmark (#75). Both baselines fail the gate identically; both agents reach PASS autonomously with zero interventions (Next 139s, GemStack 145s). Independently re-verified against fresh DBs. Records results.md plus the agents' solution patches; the baseline apps stay pristine.
Member
Author
|
Not merging. The benchmark is relocating to its real home, suleimansh/vike-data, where the extension family it now measures (vike-auth, vike-data, vike-notifications, vike-stripe) actually lives. This Phase 0 run proved the method (harness + rubric work; a plain CRUD task doesn't differentiate) — that learning carries over to the vike-data benchmark. Closing here rather than landing an AI-only run log on a repo the benchmark is leaving. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First end-to-end Phase 0 run of the "our AI vs Next.js" benchmark (#75, #78).
What this is
The actual measurement run on top of the merged Phase 0 harness. Not new product code, just recorded results plus the agents' solution patches. The baseline apps stay pristine (the Phase 0 starting point); only
benchmarks/runs/is added.Method
Results
All 14 contract checks pass on both. Diffs comparable (~130-150 lines each).
Reading
The rubric runs end to end and the gate is objective and reproducible, so Phase 0's goal is met. A plain CRUD-extension task does not exercise the orchestration layer, so it does not differentiate here (both fast, zero interventions). The intervention metric will only show signal on AI-integration / multi-step / refactor tasks. Phase 1 (#79) should weight the task set toward those.
See
benchmarks/runs/2026-06-28-task-001-tags/results.mdfor the full log and reproduce steps.