Skip to content

[aw-failures] Two daily workflows fail at agent start: Documentation Healer (effort-param 400) & Model Inventory Checker (BYOK auth) #37039

@github-actions

Description

@github-actions

Recommendation

Two daily scheduled workflows fail at agent start, before any work, due to engine/model configuration — fix both: drop the unsupported effort parameter for Documentation Healer, and restore BYOK auth token plumbing for Model Inventory Checker. Neither is a network/firewall issue (audit-diff shows 0 new domains, 0 anomalies vs the last green run).

This sub-issue root-causes the shallow auto-notifier issues #37010 and #37014.

Cluster A — Daily Documentation Healer: effort parameter rejected (fresh regression)

Fix: remove or guard the effort parameter for the Claude small-agent model (or select a model that supports effort).

  • Affected run: 26986947133 (schedule, 2026-06-05T00:01Z) · notifier issue [aw] Daily Documentation Healer failed #37010
  • Regression signal: 7 consecutive prior days success, first failure todayfailure success success success success success success success.
  • Dominant error (all 4 attempts, ~0s each):
API Error: 400 This model does not support the effort parameter.
  • Engine = Claude Code, model = small-agent. The --continue retry path then hit Error: No deferred tool marker found in the resumed session (isNoDeferredMarkerError=true) and --continue was disabled permanently. Agent produced 0 real turns / 0 tokens.
  • audit-diff (base success [26921451398] vs failure 26986947133): new_domain_count=0, anomaly_count=0, run2_turns=1 — regression is isolated to the agent engine invocation, not networking.
  • Failure class: config-error.

Cluster B — Daily Model Inventory Checker: BYOK auth missing (persistent regression)

Fix: restore BYOK token plumbing so the Copilot SDK driver receives valid auth (COPILOT_GITHUB_TOKEN / GH_TOKEN / GITHUB_TOKEN present in the agent env).

[Error: Execution failed: Error: Session was not created with authentication info or custom provider]
  • The Copilot SDK BYOK driver sample (.github/drivers/copilot_sdk_driver_sample_node.cjs) threw an uncaught promise rejection on attempt 1. Harness flagged isAuthError=true: "no authentication information found — not retrying (COPILOT_GITHUB_TOKEN, GH_TOKEN, and GITHUB_TOKEN are all absent or invalid)". The entrypoint unset COPILOT_GITHUB_TOKEN and COPILOT_PROVIDER_API_KEY just before the driver spawned.
  • All upstream collect_* setup jobs succeeded; failure isolated to the agent driver; detection/safe_outputs skipped. 0 tokens.
  • Failure class: config-error.

Affected workflows and run IDs

Cluster Workflow Run Notifier issue Failure class
A Daily Documentation Healer 26986947133 #37010 config-error (effort param)
B Daily Model Inventory Checker 26987484745 #37014 config-error (BYOK auth)

Success criteria / verification

  • Cluster A: Documentation Healer agent phase executes (>0 turns) and the run completes success; no 400 ... effort parameter error.
  • Cluster B: Model Inventory Checker driver creates a session; no Session was not created with authentication info error; auth tokens present in the agent env.
  • Both workflows green for at least 2 consecutive scheduled runs.

Parent: #37005 · Root-causes #37010, #37014 · Analyzed run IDs: 26986947133, 26987484745, 26921451398 · Window: last 6h ending ~2026-06-05T01:34Z.
Related to #37005

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 20.2M · 1.3K AIC ·

  • expires on Jun 12, 2026, 2:35 AM UTC

Update 2026-06-05 ~13:46Z — Cluster A (effort param) scope expansion: now hits the agent/default variant, third workflow affected

Fresh evidence from the last-6h sweep shows the 400 ... effort parameter failure is not specific to the Claude small-agent model as originally scoped. It also fired on the default agent variant, on a workflow not previously listed here.

New affected run

  1. Daily Go Function Namer — run 27014847510 (schedule, 2026-06-05T12:28Z).
  2. Engine = claude; experiment model_size selected variant agent (confirmed: ANTHROPIC_MODEL: agent, experiment.model_size=agent) — i.e. the large/default group (sonnet-6x, gpt-5.4, gpt-5.3, gemini-pro, any), not small-agent.
  3. Dominant error on all 3 retry attempts (~0.4s each), identical to Cluster A:
API Error: 400 This model does not support the effort parameter.
  1. Agent produced 1 turn / 0 tokens / $0; --continue retry path again hit the no-deferred-marker condition and was disabled permanently (same harness path as Documentation Healer).
  2. audit-diff (base last-green [26951925565] vs failure 27014847510): new_domain_count=0, status_change_count=0, anomaly_count=0, has_anomalies=false — regression isolated to the agent engine invocation, not networking/firewall.
  3. Regression signal: 7 consecutive prior green days, first failure today — matches Documentation Healer's onset pattern (first failure 2026-06-05T00:01Z).

Implication for the fix

Because both the small-agent (Documentation Healer) and the agent (Daily Go Function Namer) variants fail with the same 400, the defect is not per-workflow / per-variant configuration. The effort parameter is being attached to the Claude engine request for models that reject it, regardless of model-size variant. The fix should guard effort at the engine/token-steering layer (omit it whenever the resolved concrete model does not advertise effort support) rather than editing individual workflows.

Updated affected-workflow table (Cluster A)

Workflow Variant Run First failure
Daily Documentation Healer small-agent 26986947133 2026-06-05T00:01Z
Daily Go Function Namer agent 27014847510 2026-06-05T12:28Z

Updated success criteria (Cluster A)

  • No 400 ... effort parameter error on any model-size variant (agent and small-agent).
  • Documentation Healer and Daily Go Function Namer each green for ≥2 consecutive scheduled runs.
Other failures in this 6h window (no action — for the record)
  • Test Quality Sentinel run 27013336594 (pull_request, 11:54Z): Copilot CLI 15-minute execution timeout after 50 turns / ~1.5M tokens on PR branch copilot/aw-compat-fix-codemod-issue. Assessed one-off (11 surrounding TQS runs succeeded; failure class = execution-timeout on a large PR), not a systemic cluster — no issue filed.
  • CGO (27013363820) and CJS (27013363789): non-agentic compile/build CI checks — out of scope for agentic-failure tracking.

Analyzed run IDs: 27014847510, 26951925565, 27013336594 · Window: last 6h ending ~2026-06-05T13:46Z.

Generated by 🔍 [aw] Failure Investigator (6h) · 272.5 AIC ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions