Skip to content

Propagate --no-github flag to sub-agents via env var + prompt injection#789

Merged
osilkin98 merged 1 commit into
akashgit:mainfrom
RobotSail:factory/run-953e2477
Jun 26, 2026
Merged

Propagate --no-github flag to sub-agents via env var + prompt injection#789
osilkin98 merged 1 commit into
akashgit:mainfrom
RobotSail:factory/run-953e2477

Conversation

@RobotSail

Copy link
Copy Markdown
Contributor

Closes #787

Changes

  • factory/cli.py: Set FACTORY_NO_GITHUB=1 env var in cmd_ceo() and cmd_run() when --no-github is passed, so the flag propagates to all subprocess environments
  • factory/agents/runner.py: In invoke_agent(), after resolving the prompt, check FACTORY_NO_GITHUB env var and append a "GitHub Disabled" directive instructing sub-agents to skip all gh CLI commands and GitHub operations
  • tests/test_agents.py: Added TestNoGithubPropagation test class with 4 tests verifying prompt injection when env var is set/unset and env var propagation behavior

Set FACTORY_NO_GITHUB=1 env var in cmd_ceo() and cmd_run() when
--no-github is passed. In invoke_agent(), check this env var and append
a "GitHub Disabled" directive to the agent prompt, instructing sub-agents
to skip all gh CLI commands and GitHub operations.

Closes akashgit#787

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.79%. Comparing base (8463ba8) to head (733fb15).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #789   +/-   ##
=======================================
  Coverage   86.78%   86.79%           
=======================================
  Files          80       80           
  Lines       12134    12140    +6     
=======================================
+ Hits        10531    10537    +6     
  Misses       1603     1603           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RobotSail

Copy link
Copy Markdown
Contributor Author

✅ Factory Review: KEEP

Verdict: KEEP
Reason:


Posted by Factory CEO

@RobotSail

Copy link
Copy Markdown
Contributor Author

CEO Review — Independent Assessment

PR Summary

This PR fixes #787 by propagating the --no-github flag to all sub-agents via two mechanisms:

  1. Env var propagation (factory/cli.py): Sets FACTORY_NO_GITHUB=1 in both cmd_ceo and cmd_run entry points so all child processes inherit it.
  2. Prompt injection (factory/agents/runner.py): After resolving the agent prompt in invoke_agent(), checks the env var and appends a ## GitHub Disabled directive — ensuring every agent (Builder, QA, Researcher, etc.) receives the instruction in its prompt without modifying individual .md prompt files.

What I Verified

  • Read the full diff: 3 files changed, all within expected scope
  • No modifications to agent prompt .md files (correct — injection at runner level is cleaner)
  • Env var check uses strict == "1" comparison, avoiding false positives from empty strings or "0"
  • The injection point is after resolve_prompt() and playbook injection, so it stacks correctly with existing prompt composition
  • Confirmed the Claude runner copies os.environ to subprocess env (claude.py:116), so the env var chain is unbroken from CLI → CEO → Builder
  • The existing CEO task string injection at cli.py:3501-3511 is kept as belt-and-suspenders — intentional redundancy

QA Results

  • Tests: 2626 passed, 0 failed, 12 skipped
  • Lint: Clean
  • Type check: Clean
  • New tests: 4 added in TestNoGithubPropagation — positive/negative injection tests + env var behavior
  • Adversarial: 4/4 feature scenarios, 5/5 edge cases passed

Notes

  • Tests 3-4 (test_env_var_propagates_to_subprocess_env, test_env_var_absent_by_default) are trivially testing Python's os.environ semantics rather than the actual cmd_ceo/cmd_run code paths. Not harmful, but a future improvement could test the CLI entry points directly.
  • The directive text in runner.py differs slightly from the CEO task text in cli.py:3503-3513 — could be unified for consistency but not required.

Verdict: KEEP

@osilkin98 osilkin98 merged commit 9227f47 into akashgit:main Jun 26, 2026
6 of 7 checks passed
@github-actions

Copy link
Copy Markdown

Benchmark Results

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ✅ RESOLVED
Score 1
Duration 152s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "claude-code",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 152,
  "status": "success",
  "timestamp": "20260626T005914Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 0.61376175,
    "input_tokens": 24,
    "output_tokens": 1977,
    "cache_read_tokens": 446961,
    "cache_creation_tokens": 53917
  }
}

featurebench

Field Value
Benchmark featurebench
Instance pypa__packaging.013f3b03.test_metadata.e00b5801.lv1
Result ❌ NOT RESOLVED
Score 0
Duration 1219s
Full JSON
{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "factory",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1219,
  "status": "success",
  "timestamp": "20260626T005916Z",
  "details": {
    "pass_rate": 0,
    "solver": "factory",
    "cost_usd": 6.452506050000002,
    "input_tokens": 301,
    "output_tokens": 39428,
    "cache_read_tokens": 7080940,
    "cache_creation_tokens": 0
  }
}

programbench

Field Value
Benchmark programbench
Instance abishekvashok__cmatrix.5c082c6
Result ❌ NOT RESOLVED
Score 0
Duration 316s
Full JSON
{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "claude-code",
  "passed": 0,
  "total": 769,
  "score": 0,
  "resolved": false,
  "duration_seconds": 316,
  "status": "success",
  "timestamp": "20260626T005916Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 0.8677075000000001,
    "input_tokens": 32,
    "output_tokens": 5251,
    "cache_read_tokens": 938020,
    "cache_creation_tokens": 42762
  }
}

terminalbench

Field Value
Benchmark terminalbench
Instance fix-git
Result ✅ RESOLVED
Score 1
Duration 71s
Full JSON
{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "claude-code",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 71,
  "status": "success",
  "timestamp": "20260626T005916Z",
  "details": {
    "solver": "claude-code",
    "cost_usd": 0.552441,
    "input_tokens": 15,
    "output_tokens": 1399,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

terminalbench

Field Value
Benchmark terminalbench
Instance fix-git
Result ✅ RESOLVED
Score 1
Duration 792s
Full JSON
{
  "benchmark": "terminalbench",
  "instance_id": "fix-git",
  "solver": "factory",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 792,
  "status": "success",
  "timestamp": "20260626T005918Z",
  "details": {
    "solver": "factory",
    "cost_usd": 0,
    "input_tokens": 0,
    "output_tokens": 0,
    "cache_read_tokens": 0,
    "cache_creation_tokens": 0
  }
}

featurebench

Field Value
Benchmark featurebench
Instance pypa__packaging.013f3b03.test_metadata.e00b5801.lv1
Result ❌ NOT RESOLVED
Score 0
Duration 438s
Full JSON
{
  "benchmark": "featurebench",
  "instance_id": "pypa__packaging.013f3b03.test_metadata.e00b5801.lv1",
  "solver": "claude-code",
  "passed": 0,
  "total": 1,
  "score": 0,
  "resolved": false,
  "duration_seconds": 438,
  "status": "success",
  "timestamp": "20260626T005919Z",
  "details": {
    "pass_rate": 0,
    "solver": "claude-code",
    "cost_usd": 1.3952295,
    "input_tokens": 22,
    "output_tokens": 13221,
    "cache_read_tokens": 905434,
    "cache_creation_tokens": 88642
  }
}

swebench

Field Value
Benchmark swebench
Instance sympy__sympy-20590
Result ✅ RESOLVED
Score 1
Duration 2216s
Full JSON
{
  "benchmark": "swebench",
  "instance_id": "sympy__sympy-20590",
  "solver": "factory",
  "passed": 1,
  "total": 1,
  "score": 1,
  "resolved": true,
  "duration_seconds": 2216,
  "status": "success",
  "timestamp": "20260626T005919Z",
  "details": {
    "solver": "factory",
    "cost_usd": 14.172166800000003,
    "input_tokens": 652,
    "output_tokens": 92415,
    "cache_read_tokens": 15765593,
    "cache_creation_tokens": 0
  }
}

programbench

Field Value
Benchmark programbench
Instance abishekvashok__cmatrix.5c082c6
Result ❌ NOT RESOLVED
Score 0
Duration 1058s
Full JSON
{
  "benchmark": "programbench",
  "instance_id": "abishekvashok__cmatrix.5c082c6",
  "solver": "factory",
  "passed": 0,
  "total": 769,
  "score": 0,
  "resolved": false,
  "duration_seconds": 1058,
  "status": "success",
  "timestamp": "20260626T005920Z",
  "details": {
    "solver": "factory",
    "cost_usd": 6.039037199999999,
    "input_tokens": 221,
    "output_tokens": 31244,
    "cache_read_tokens": 6124069,
    "cache_creation_tokens": 0
  }
}

Overall: 50.0% accuracy (= +0.0% vs main) | $4.30 avg cost | 783s avg duration

Comparison vs Main

Benchmark Solver Score vs Main Cost vs Main Duration vs Main
swebench claude-code 1 +0.0% = $0.61 = $0.00 152s = 0s
featurebench factory 0 = 0% $6.45 = $0.00 1219s = 0s
programbench claude-code 0 = 0% $0.87 = $0.00 316s = 0s
terminalbench claude-code 1 +0.0% = $0.55 = $0.00 71s = 0s
terminalbench factory 1 +0.0% = N/A N/A 792s = 0s
featurebench claude-code 0 = 0% $1.40 = $0.00 438s = 0s
swebench factory 1 +0.0% = $14.17 = $0.00 2216s = 0s
programbench factory 0 = 0% $6.04 = $0.00 1058s = 0s

Baseline: latest main branch run per benchmark+solver. ▲ = improvement, ▼ = regression.

How these benchmarks run

Factory solver: Runs factory ceo . --headless --no-github --prompt <task> — full factory loop (research → strategize → build → review). See benchmarks/run-swebench.sh.

Claude Code solver: Runs claude -p <task> --model claude-opus-4-6[1m] --max-turns 200 — single-shot solve. Same script files as factory, switched via --solver flag.

TerminalBench: Uses Harbor framework. Factory runs via custom factory_harbor_agent.py, Claude Code uses Harbor's built-in agent.

ProgramBench: Both solvers run inside a Docker cleanroom container. See benchmarks/run-programbench.sh.

Config: claude-opus-4-6[1m], effort=XHIGH, thinking=128K tokens. See benchmarks/lib.sh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix --no-github

2 participants