You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Integrate high-value patterns identified from researching affaan-m/everything-claude-code and openai/codex-plugin-cc into HiveSpec's delivery lifecycle skills, with evals proving each change improves agent behavior.
What to add: A required output format section instructing reviewers to return findings as a structured table:
| # | Severity | File:Line| Finding | Recommendation ||---|----------|-----------|---------|----------------|| 1 | high | src/foo.ts:42 | Missing null check on user input | Add guard clause |**Verdict**: approve | needs-attention
**Next steps**: [specific actions if needs-attention]
Why: Currently reviewers return prose, making it hard for the parent agent to systematically track whether all findings were addressed before proceeding to hs-ship. Structured output makes findings enumerable and trackable.
Also update:skills/hs-verify/SKILL.md Step 4 — add a note that the parent agent should confirm every "needs-attention" finding was resolved before moving to Step 5.
## Output Contract
Every response must end with:
**Status**: DONE | DONE_WITH_CONCERNS | NEEDS_CONTEXT | BLOCKED
**Files changed**:
-`path/to/file.ts` — one-line description
**Tests added/modified**:
-`path/to/test.ts` — what it covers
**Unresolved concerns**:
- Any shared types/interfaces modified that other subagents might also touch
- Anything noticed but out of scope for this task
Also update:skills/hs-implement/SKILL.md Subagent Review Protocol — add an integration safety check: after collecting outputs from parallel subagents, verify no two subagents modified the same shared type/interface. If they did, reconcile before committing.
Why: Currently hs-implement defines subagent status codes (DONE, DONE_WITH_CONCERNS, etc.) but doesn't require structured output listing files and concerns. The parent agent has to infer what changed from prose, making integration conflicts between parallel subagents invisible until tests break.
3. Confidence-based retro accumulation
Files to edit:
skills/hs-retro/SKILL.md
What to add: A new Step 3b "Check recurrence" between current Step 3 (Classify) and Step 4 (Apply fixes):
### Step 3b: Check recurrence
Before applying a fix, assess whether this is a one-off or a pattern:
1.**First occurrence**: Apply the fix only if it's clearly a systematic gap (not a user preference or one-off situation). Otherwise, note it as a candidate for future confirmation.
2.**Recurs across 2+ sessions**: Apply with high confidence — this is a systematic gap.
3.**Recurs across 3+ sessions AND across different repos**: Consider promoting the fix from project-scoped to HiveSpec core skills.
Single-session over-fitting is how skills accumulate noise. Require recurrence before structural changes.
Why: Currently hs-retro treats every session independently — a one-time user preference gets the same treatment as a pattern that appears in every session. This leads to over-fitting skills to single-session noise.
4. Reliability check for non-deterministic features
Files to edit:
skills/hs-verify/SKILL.md
What to add: A new Step 2b after Step 2 (E2E red/green protocol):
### Step 2b: Reliability check (non-deterministic features only)
If the feature involves LLM calls, external APIs, or historically flaky tests:
1. Run the relevant test suite **3 times** (pass^3 protocol)
2. All 3 must pass — any failure means the feature is unreliable, not "flaky"
3. For LLM-dependent features: use at least 2 different representative inputs
Skip this step for purely deterministic code paths (config parsing, data transformations, etc.) — one green run is sufficient.
Why: A single passing test run doesn't prove reliability for non-deterministic features. pass^3 catches intermittent failures that a single run misses, while the deterministic carve-out avoids wasting time re-running stable code.
Evals
Each change needs A/B evals in hivespec-evals proving the updated skill produces measurably better agent behavior than the current skill. Use AgentV EVAL.yaml format.
Eval 1: Structured review output
Test ID
What It Checks
Key Assertions
review-output-is-structured
Reviewer returns tabular findings with severity, file:line, verdict
contains: "| Severity |", contains-any: ["approve", "needs-attention"], rubrics: findings have file refs + are actionable
review-findings-trackable
Parent agent can enumerate findings and check resolution status per finding
rubrics: all findings enumerated, each gets resolved/unresolved status
Objective
Integrate high-value patterns identified from researching affaan-m/everything-claude-code and openai/codex-plugin-cc into HiveSpec's delivery lifecycle skills, with evals proving each change improves agent behavior.
Full research: agentevals-research HIVESPEC_INTEGRATION.md
Design Latitude
.mdfiles) — no code, no infrastructure, no new dependenciesChanges
1. Structured review output schema
Files to edit:
skills/hs-implement/references/spec-reviewer-prompt.mdskills/hs-implement/references/code-quality-reviewer-prompt.mdWhat to add: A required output format section instructing reviewers to return findings as a structured table:
Why: Currently reviewers return prose, making it hard for the parent agent to systematically track whether all findings were addressed before proceeding to hs-ship. Structured output makes findings enumerable and trackable.
Also update:
skills/hs-verify/SKILL.mdStep 4 — add a note that the parent agent should confirm every "needs-attention" finding was resolved before moving to Step 5.2. Subagent output contract
Files to edit:
skills/hs-implement/references/implementer-prompt.mdWhat to add: A required output contract section:
Also update:
skills/hs-implement/SKILL.mdSubagent Review Protocol — add an integration safety check: after collecting outputs from parallel subagents, verify no two subagents modified the same shared type/interface. If they did, reconcile before committing.Why: Currently hs-implement defines subagent status codes (DONE, DONE_WITH_CONCERNS, etc.) but doesn't require structured output listing files and concerns. The parent agent has to infer what changed from prose, making integration conflicts between parallel subagents invisible until tests break.
3. Confidence-based retro accumulation
Files to edit:
skills/hs-retro/SKILL.mdWhat to add: A new Step 3b "Check recurrence" between current Step 3 (Classify) and Step 4 (Apply fixes):
Why: Currently hs-retro treats every session independently — a one-time user preference gets the same treatment as a pattern that appears in every session. This leads to over-fitting skills to single-session noise.
4. Reliability check for non-deterministic features
Files to edit:
skills/hs-verify/SKILL.mdWhat to add: A new Step 2b after Step 2 (E2E red/green protocol):
Why: A single passing test run doesn't prove reliability for non-deterministic features.
pass^3catches intermittent failures that a single run misses, while the deterministic carve-out avoids wasting time re-running stable code.Evals
Each change needs A/B evals in hivespec-evals proving the updated skill produces measurably better agent behavior than the current skill. Use AgentV EVAL.yaml format.
Eval 1: Structured review output
review-output-is-structuredcontains: "| Severity |",contains-any: ["approve", "needs-attention"],rubrics: findings have file refs + are actionablereview-findings-trackablerubrics: all findings enumerated, each gets resolved/unresolved statusEval 2: Subagent output contract
subagent-returns-contractcontains-all: ["Status:", "Files changed:", "Tests added"],rubrics: status is valid enum, concerns surfacedintegration-conflict-detectedrubrics: conflict identified, resolution proposedEval 3: Reliability check
nondeterministic-gets-multiple-runsrubrics: multiple runs, recognizes nondeterminismdeterministic-gets-single-runrubrics: single run sufficient, no unnecessary rerunsEval 4: Retro accumulation
first-occurrence-marked-candidaterubrics: candidate marked, no premature skill editrecurring-pattern-appliedrubrics: recurrence recognized, fix applied citing recurrenceRunning evals
Key metrics:
pass_ratedelta,token_usagedelta, per-rubric score improvements.Acceptance Signals
Non-Goals