Skip to content

fix: replace ghost agent roles (evaluator/reviewer)#802

Merged
gx-ai-architect merged 4 commits into
mainfrom
fix/ghost-agent-roles
Jun 26, 2026
Merged

fix: replace ghost agent roles (evaluator/reviewer)#802
gx-ai-architect merged 4 commits into
mainfrom
fix/ghost-agent-roles

Conversation

@gx-ai-architect

@gx-ai-architect gx-ai-architect commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes 4 issues found during expected-behavior doc work — all prompt-level changes, no code.

Fixes: #794, #795, #797, #799

Changes

  1. Replace factory agent evaluatorfactory eval "$PROJECT_PATH" in 5 SKILL.md files (Build, Design, Improve, Research, Review)

    • The evaluator role does not exist in AgentRole — CLI crashes with error: invalid choice: 'evaluator'
    • Trace evidence (trace bc24771b): CEO already runs factory eval directly, never invokes evaluator
  2. Replace factory agent reviewerfactory agent qa in Refine SKILL.md

    • The reviewer role does not exist — same crash
    • Task description is clearly meant for the QA agent (3-section pipeline)

Evidence

  • CLI crash: factory agent evaluator --task "test" --project /tmp → exit code 2
  • Code: AgentRole = Literal["researcher", "strategist", "builder", "qa", "archivist", "ceo", "failure_analyst", "refiner", "profiler"] — no evaluator, no reviewer
  • Trace bc24771b: CEO invoked factory agent qa 3 times, factory eval 3 times, factory agent evaluator 0 times

Test plan

  • grep -r "factory agent evaluator" skills/ returns no results
  • grep -r "factory agent reviewer" skills/ returns no results
  • grep -r "archivist" skills/workflow-discover/SKILL.md returns the new archival step
  • grep -r "archivist" skills/workflow-review/SKILL.md returns the new archival step

🤖 Generated with Claude Code

…rchival

- Replace factory agent evaluator → factory eval in 5 SKILL.md files
- Replace factory agent reviewer → factory agent qa in Refine workflow
- Add archivist to Discover and Review workflows (Sacred Rule 7)

Fixes #794, #795, #797, #799
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

Sentrux Quality Report

Absolute

Scanning ....
[scan] git ls-files: 311 total, 299 kept, 12 dropped (ext:12, meta:0, big:0)
[build_project_map] 299 files, 53 unique dirs, 49 cache misses, 2.3ms
[resolve] 439 resolved, 770 unresolved (of 1209 total specs)
[resolve_imports] project_map 2.4ms, suffix_idx 0.7ms, suffix_resolve 9.8ms, total 12.9ms
[build_graphs] 299 files | maps 0.8ms, imports 13.0ms, calls+inherit 3.5ms, total 17.3ms | 438 import, 4376 call, 0 inherit edges
sentrux check — 2 rules checked

Quality: 4702

✗ [Error] max_cc: 5 function(s) exceed max cyclomatic complexity of 30
    factory/cli.py:cmd_ceo (cc=79)
    factory/study.py:study_project_local (cc=43)
    factory/cli.py:_welcome_wizard (cc=39)
    factory/cli.py:cmd_run (cc=37)
    factory/workflow/validation.py:validate_workflow (cc=31)

✗ 1 violation(s) found

Diff (vs base branch)

Scanning ....
[scan] git ls-files: 311 total, 299 kept, 12 dropped (ext:12, meta:0, big:0)
[build_project_map] 299 files, 53 unique dirs, 49 cache misses, 2.4ms
[resolve] 439 resolved, 770 unresolved (of 1209 total specs)
[resolve_imports] project_map 2.5ms, suffix_idx 0.7ms, suffix_resolve 10.0ms, total 13.2ms
[build_graphs] 299 files | maps 1.2ms, imports 13.2ms, calls+inherit 3.3ms, total 17.6ms | 438 import, 4376 call, 0 inherit edges
sentrux gate — structural regression check

Quality:      4702 -> 4702
Coupling:     0.75 → 0.75
Cycles:       4 → 4
God files:    0 → 0

Distance from Main Sequence: 0.35

✓ No degradation detected

@gx-ai-architect gx-ai-architect self-assigned this Jun 26, 2026
@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

Trace-Based Test Results

Test 1: Interactive Build (calculator CLI)

Trace ID: 6dd2610198eb72920cc2b9212ca3fe3a
Langfuse: http://localhost:3000 → trace 6dd261...
Branch: fix/ghost-agent-roles
Duration: 9m 35s

Agent call sequence:

1. factory agent researcher   (parallel x3 — similar, techstack, pitfalls)
2. factory agent strategist   (phased build plan)
3. factory agent builder      (implemented calculator)
4. factory agent qa           (14 adversarial checks, all passed)

Ghost role check:

  • factory agent evaluator calls: 0
  • factory agent reviewer calls: 0

Result: Build completed successfully. QA passed all 14 checks (basic ops, division by zero, invalid input, float arithmetic, negative numbers).


Test 2: Headless Build (greeting CLI)

Trace ID: dc3b9aeab025e0b63440a435598ef174 (archivist final)
Branch: fix/ghost-agent-roles with --headless --no-github
Duration: ~28 min (includes 3 CEO respawns + improve cycle)

Full agent event sequence from .factory/events.jsonl:

17:42:55  agent.started   ceo
17:43:35  agent.started   researcher (x3 parallel)
17:45:54  agent.completed researcher (all 3)
17:46:34  agent.started   strategist
17:47:49  agent.completed strategist
17:48:14  agent.started   archivist (fire-and-forget)
17:48:27  agent.started   builder
17:50:04  agent.completed builder
17:50:45  agent.started   qa
17:51:21  agent.completed qa
17:51:58  agent.started   archivist
17:52:37  agent.completed ceo (session 1)
17:52:38  agent.started   ceo (respawn)
17:53:44  experiment.begin
17:53:50  experiment.finalize
17:54:30  agent.started   archivist
17:55:28  ceo completed   (session 2)
17:55:55  agent.started   ceo (respawn 2)
17:58:09  experiment.begin
17:58:20  agent.started   builder
17:59:03  agent.completed builder
17:59:39  agent.started   qa
18:00:20  agent.completed qa
18:00:28  experiment.finalize
18:00:38  experiment.begin
18:00:49  agent.started   builder
18:01:27  agent.completed builder
18:01:41  agent.started   qa
18:02:36  agent.completed qa
18:02:52  experiment.finalize
18:03:06  agent.started   qa (E2E)
18:04:51  Build complete   (2/2 phases KEEP, all evals 1.0)
18:11:18  agent.completed ceo (final)

Ghost role check across ALL 5 CEO sessions:

  • factory agent evaluator calls: 0
  • factory agent reviewer calls: 0
  • All agent invocations use valid roles: researcher, strategist, builder, qa, archivist

Result: Headless build completed. 2 phases built and kept. E2E verified. Archival completed.

@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

@ceo-review

@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

I'm not sure this archivist is needed here:

`Add archivist to Discover and Review workflows (Sacred Rule 7)

These were the only 2 workflows without archival steps
Both generate artifacts worth preserving (eval profile, factory.md, baseline scores)
`

@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

Self-Correction: Test Relevance

Honest assessment — my Build mode tests don't properly validate this PR's changes:

Change Test coverage Gap
evaluatorfactory eval (5 SKILL.md files) ⚠️ Build mode only CEO already compensated pre-fix — "0 evaluator calls" was true before AND after. Need Improve/Research mode to see if the SKILL.md eval instruction is actually followed differently
reviewerqa (Refine SKILL.md) ❌ Not tested Need a Refine mode test (factory ceo --refine "...")
Add archivist to Discover/Review ❌ Not tested Need Discover + Review mode tests

On the archivist question

@gx-ai-architect You're right to question this. Sacred Rule 7 says "Archivist must fire after each verdict" — but Discover and Review modes don't have verdicts. They're one-shot setup operations (generate eval profile, create factory.md), not experiment cycles. Adding archivist here is arguably scope creep.

Proposal: I'll split the archivist addition (Change 3) out of this PR. This PR should only contain the ghost role fixes (Changes 1 & 2), which are unambiguous bugfixes. The archivist question can be a separate discussion.

What I should test instead

  1. Refine modefactory ceo --refine "add a docstring to the main function" on an existing project → verify factory agent qa is called (not reviewer)
  2. Improve mode — run an improve cycle → verify the eval step uses factory eval (not factory agent evaluator)
  3. Discover mode — run on a project without factory setup → verify factory discover works (no archivist change needed)

Will run these targeted tests and update.

… role fixes only

Split out the archivist additions per review feedback. Discover and Review
modes don't have verdicts, so Sacred Rule 7 ("after each verdict") doesn't
strictly apply. The archivist question is a separate discussion.

This PR now only contains:
- evaluator → factory eval in Build/Design/Improve/Research/Review
- reviewer → qa in Refine
@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

ok, not split into another PR, but archivist should not be in the discover mode nor review mode

@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

wait, there's verdict in the QA agent right? those are reviews and e2e experimetn runs. archivist should probably fire after qa agent?

@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

Re: Archivist and QA Verdicts

Good question — tracing through the verdict flow:

Workflows with QA → archivist already fires correctly:

Builder → QA (verdict: CLEAN/ISSUES_FOUND/REVERT) → CEO reads QA → factory finalize → Archivist fires ✅

This chain exists in Improve, Research, Refine, and Build. The archivist fires after factory finalize in each.

Discover/Review — no QA, no verdicts:

  • Discover runs factory discover (generates eval dimensions) — no experiment, no QA, no verdict
  • Review runs factory eval + creates factory.md — no experiment, no QA, no verdict
  • No natural archivist trigger point in either

Conclusion: Archivist is already correctly wired after QA in all workflows that have QA. Discover/Review have no QA, so the archivist removal from this PR is correct.

Current PR scope (after removing archivist additions):

  1. evaluatorfactory eval in Build/Design/Improve/Research/Review SKILL.md
  2. reviewerqa in Refine SKILL.md

Targeted Improve + Refine mode tests running now — will post results when complete.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Factory Review: REVERT

Verdict: REVERT
Reason: QA Agent found test suite regression: test_qa_delegation.py fails because PR introduces standalone factory eval calls that violate existing test invariant requiring eval to be wrapped in agent tasks. Test must be updated or eval calls must use valid agent roles.

Score Comparison

Metric Value
Before 0.7102
After 0.7102
Delta +0.0000
Threshold 0.8000

Guard Checks

Check Result
tests ❌ FAIL
lint ✅ PASS
type_check ✅ PASS

Precheck Gate

Score delta: +0.0000 (0.7102 → 0.7102), threshold: 0.6000, anti-pattern: none

Code Review Notes

  • Critical
  • tests/test_qa_delegation.py:94
  • test_workflow_skills_delegate_eval_to_agents fails - standalone factory eval calls violate test invariant
  • Important
  • All SKILL.md files
  • 4/4 target issues (794,795,797,799) addressed correctly but test file not updated

Posted by Factory CEO

The test_qa_delegation test requires eval calls in SKILL.md to be
wrapped in agent tasks, not standalone. Changed factory eval to
factory agent qa with eval-focused task descriptions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gx-ai-architect

Copy link
Copy Markdown
Collaborator Author

@ceo-review

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.12%. Comparing base (0fb5f7d) to head (c724721).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #802   +/-   ##
=======================================
  Coverage   87.12%   87.12%           
=======================================
  Files          81       81           
  Lines       12209    12209           
=======================================
  Hits        10637    10637           
  Misses       1572     1572           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

factory eval runs 13 deterministic Python checks — no LLM agent needed.
Reverted SKILL.md eval calls from factory agent qa back to standalone
factory eval. Updated test to check for invalid ghost roles instead of
requiring eval-in-agent wrapping.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Factory Review: KEEP

Verdict: KEEP
Reason: Fixes 3/4 claimed issues by replacing ghost agent roles (evaluator→factory eval, reviewer→qa). Prevents CLI crashes. Issue #797 (archivist) not addressed but non-blocking.

Precheck Gate

Prompt-only change. No source code modified. Guard/threshold checks N/A.

Code Review Notes

  • Correct AgentRole replacements
  • Test suite passes (8/8)
  • Scope gap: Issue #797 not fixed (no archivist additions to Discover/Review)
  • Misleading PR description claims Fixes #797
  • Recommend updating PR description to remove #797

Posted by Factory CEO

Repository owner deleted a comment from github-actions Bot Jun 26, 2026
@gx-ai-architect gx-ai-architect merged commit 525e5b9 into main Jun 26, 2026
7 checks passed
@gx-ai-architect gx-ai-architect changed the title fix: replace ghost agent roles (evaluator/reviewer) and add missing archival fix: replace ghost agent roles (evaluator/reviewer) Jun 26, 2026
@github-actions

Copy link
Copy Markdown

Benchmark Results

programbench

Field Value
Benchmark programbench
Instance abishekvashok__cmatrix.5c082c6
Result ❌ NOT RESOLVED
Score 0
Duration 493s
Full JSON
{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "claude-code",
  "passed": 0,
  "total": 769,
  "score": 0,
  "resolved": false,
  "duration_seconds": 493,
  "status": "success",
  "timestamp": "20260626T201122Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 1.8127465000000003,
    "input_tokens": 39,
    "output_tokens": 5509,
    "cache_read_tokens": 1011765,
    "cache_creation_tokens": 51700
  }
}

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ✅ RESOLVED
Score 1
Duration 142s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "claude-code",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 142,
  "status": "success",
  "timestamp": "20260626T201123Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 0.9764285000000003,
    "input_tokens": 25,
    "output_tokens": 1927,
    "cache_read_tokens": 418097,
    "cache_creation_tokens": 114452
  }
}

featurebench

Field Value
Benchmark featurebench
Instance pypa__packaging.013f3b03.test_metadata.e00b5801.lv1
Result ✅ RESOLVED
Score 1
Duration 254s
Full JSON
{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "claude-code",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 254,
  "status": "success",
  "timestamp": "20260626T201124Z",
  "details": {
    "pass_rate": 1,
    "solver": "claude-code",
    "cost_usd": 1.1641859999999997,
    "input_tokens": 24,
    "output_tokens": 2481,
    "cache_read_tokens": 990677,
    "cache_creation_tokens": 87814
  }
}

terminalbench

Field Value
Benchmark terminalbench
Instance fix-git
Result ✅ RESOLVED
Score 1
Duration 563s
Full JSON
{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "factory",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 563,
  "status": "success",
  "timestamp": "20260626T201124Z",
  "details": {
    "solver": "factory",
    "cost_usd": 0,
    "input_tokens": 0,
    "output_tokens": 0,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

terminalbench

Field Value
Benchmark terminalbench
Instance fix-git
Result ✅ RESOLVED
Score 1
Duration 80s
Full JSON
{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "claude-code",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 80,
  "status": "success",
  "timestamp": "20260626T201125Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 0.5114402499999999,
    "input_tokens": 17,
    "output_tokens": 1239,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ✅ RESOLVED
Score 1
Duration 1058s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 1058,
  "status": "success",
  "timestamp": "20260626T201126Z",
  "details": {
    "solver": "factory",
    "cost_usd": 7.2493247499999995,
    "input_tokens": 505,
    "output_tokens": 39672,
    "cache_read_tokens": 7054712,
    "cache_creation_tokens": 0
  }
}

featurebench

Field Value
Benchmark featurebench
Instance pypa__packaging.013f3b03.test_metadata.e00b5801.lv1
Result ❌ NOT RESOLVED
Score 0
Duration 1442s
Full JSON
{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1442,
  "status": "success",
  "timestamp": "20260626T201129Z",
  "details": {
    "pass_rate": 0,
    "solver": "factory",
    "cost_usd": 8.5757477,
    "input_tokens": 355,
    "output_tokens": 54114,
    "cache_read_tokens": 8163998,
    "cache_creation_tokens": 0
  }
}

programbench

Field Value
Benchmark programbench
Instance abishekvashok__cmatrix.5c082c6
Result ❌ NOT RESOLVED
Score 0
Duration 3560s
Full JSON
{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "factory",
  "passed": 0,
  "total": 769,
  "score": 0,
  "resolved": false,
  "duration_seconds": 3560,
  "status": "success",
  "timestamp": "20260626T201131Z",
  "details": {
    "solver": "factory",
    "cost_usd": 20.745066050000002,
    "input_tokens": 7160,
    "output_tokens": 116522,
    "cache_read_tokens": 17531606,
    "cache_creation_tokens": 0
  }
}

Overall: 62.5% accuracy (= +0.0% vs main) | $5.86 avg cost | 949s avg duration

Comparison vs Main

Benchmark Solver Score vs Main Cost vs Main Duration vs Main
programbench claude-code 0 = 0% $1.81 = $0.00 493s = 0s
swebench claude-code 1 +0.0% = $0.98 = $0.00 142s = 0s
featurebench claude-code 1 +0.0% = $1.16 = $0.00 254s = 0s
terminalbench factory 1 +0.0% = N/A N/A 563s = 0s
terminalbench claude-code 1 +0.0% = $0.51 = $0.00 80s = 0s
swebench factory 1 +0.0% = $7.25 = $0.00 1058s = 0s
featurebench factory 0 = 0% $8.58 = $0.00 1442s = 0s
programbench factory 0 = 0% $20.75 = $0.00 3560s = 0s

Baseline: latest main branch run per benchmark+solver. ▲ = improvement, ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant