Upgrade guard22/opencode-multi-auth-codex so it is reliable and low-maintenance for daily use, with strict pre-promotion validation.
Primary goal: multi-account Codex linking + rotation that just works.
- Repository location for implementation and tests:
/Users/jorgitin/Documents/projects/open_multi_auth
- Baseline upstream:
https://github.com/guard22/opencode-multi-auth-codex - This work is a hardening/improvement pass on upstream plus custom behavior changes.
- Implement in source TypeScript (
src/*), tests, and docs. Do not hand-editdist/*for release.
- Account-level re-auth from dashboard/API.
- Account-level iOS-like
Enabledswitch from dashboard/API. - Remove legacy disable control;
Enabledswitch becomes the disable mechanism. - Keep force toggle separate from account
Enabledswitch behavior and UI.
- Limits probe can fail from model/config incompatibility (for example
xhighwith probe model). - Failed probe runs may emit
token_count; those must never overwrite authoritative limits. - Weekly/five-hour values can look inconsistent if stale/error-derived snapshots are shown as fresh.
- Dashboard/API limits must come from the last successful authoritative probe only.
- Failed/incomplete probe runs must not mutate stored
weekly/fiveHourvalues. - If latest probe fails, expose error + freshness/confidence state.
- If no successful snapshot exists yet, show
unknown(never synthetic0%).
- Each phase below is an independent packet with its own markdown TODOs, verification, and handoff outputs.
- You can execute phases in separate model sessions and with different models.
- Recommended order is
A -> I, but each phase can be completed and validated independently. - A phase is considered complete only when its local verification checklist is green.
- If a phase changes shared contracts, rerun affected verification in prior completed phases.
- Core connector reliability is first priority.
- Dashboard is localhost-only (
127.0.0.1/ loopback); no remote dashboard mode. - Weighted scheduler is opt-in (
weighted-round-robin). - Force mode pins one account for 24h and auto-reverts to previous strategy.
- Antigravity remains in codebase but is off by default and hidden unless explicitly enabled.
- Promotion policy: canary -> weighted enablement -> 72h stability watch.
- Rollback trigger: any critical reliability/security incident.
- Store supports versioned schema (
v2) with deterministic migration fromv1. - Per-account fields include:
enabled,disabledAt,disabledBy,disableReason(optional). - Limits fields include:
rateLimits,rateLimitHistory,limitStatus,limitError,lastLimitProbeAt,lastLimitErrorAt. - Force fields include:
forcedAlias,forcedUntil,previousRotationStrategy,forcedBy.
- Mutating endpoints return deterministic typed errors (
code,message, optionaldetails). - Alias-scoped mutating endpoints must only mutate target alias.
-
GET /api/accountsincludes eligibility/health/cooldown + limits freshness/confidence metadata.
- Probe sets compatible reasoning effort explicitly (
lowdefault, env-overridable). - Probe retries/fallback handles
unsupported_value/reasoning.effortfailures. - Failed/incomplete probe sessions are non-authoritative and cannot overwrite limits.
- No-authoritative-data state is represented as
unknown.
- Decision logs include
requestId,strategy,selectedAlias,fallbackReason,forceState,enabledState. - Sensitive token material is redacted in all log paths.
- Clone upstream repo into
/Users/jorgitin/Documents/projects/open_multi_auth. - Install dependencies and build baseline.
- Capture baseline behavior (
status,web, limits refresh, logs). - Record baseline in
docs/QA.md.
L0: Repository Integrity
- Clone completes without corruption (
git statusclean) - All tracked files present (
git ls-files | wc -lmatches upstream) - No uncommitted changes in baseline
L1: Build Verification
-
npm cicompletes with zero exit code -
npm run buildproducesdist/directory -
npx tsc --noEmitpasses with zero type errors - Entry point (
dist/cli.js) exists and is executable
L2: Baseline Behavior Capture
-
node dist/cli.js statusexecutes without crash -
node dist/cli.js web --helpshows expected help text - Dashboard starts on port (manual check:
node dist/cli.js web --port 9999) - Baseline limits refresh command executes
- Log files generated in expected location
L3: Documentation
-
docs/QA.mdcreated with baseline metrics - Baseline command outputs captured
- Known issues from upstream documented
- Baseline artifact paths + command results documented.
- Baseline behavior snapshot stored for comparison.
- QA.md baseline section complete.
- Replace recursive retries with bounded iterative attempts.
- Enforce per-request max attempts (
<= eligible account count). - Add deterministic fail-fast when no eligible account (
NO_ELIGIBLE_ACCOUNTS). - Add top-level web route error guard.
- Enforce localhost-only host binding.
- Add store schema validation on load.
- Add store migration framework for version upgrades.
- Add in-process write lock around read-modify-write operations.
- Harden persistence write (
tmp -> fsync(file) -> rename -> fsync(dir)). - Add last-known-good snapshot + auto-restore on parse/validation failure.
- Add explicit
OPENCODE_MULTI_AUTH_CODEX_AUTH_FILEoverride. - Add OAuth callback redirect-port fallback behavior.
- Add health hysteresis/cooldown to prevent account flapping.
L1: Retry & Fail-Fast Unit Tests
- Unit test: Recursive retry replaced with bounded iteration (max 3 attempts)
- Unit test: Per-request attempts never exceed eligible account count
- Unit test:
NO_ELIGIBLE_ACCOUNTSerror thrown deterministically when no accounts available - Unit test: Retry counter increments correctly per attempt
- Unit test: Exponential backoff calculation (if implemented)
L2: Store Reliability Tests
- Unit test: Schema validation rejects malformed store files
- Unit test: v1 to v2 migration executes deterministically
- Unit test: Migration preserves existing account data
- Unit test: Write lock prevents concurrent read-modify-write corruption
- Unit test: Atomic write sequence (
tmp -> fsync -> rename -> fsync) - Crash-recovery test: Interrupt during write, verify auto-restore from last-known-good
- Crash-recovery test: Corrupted store file triggers auto-restore
- Crash-recovery test: Empty store file triggers initialization
L3: Web & Security Tests
- Unit test: Localhost-only binding rejects non-loopback hosts
- Integration test: Dashboard binds only to
127.0.0.1 - Unit test: Top-level error guard catches unhandled route errors
- Integration test: OAuth callback handles port fallback correctly
L4: Health & Cooldown Tests
- Unit test: Account health hysteresis prevents rapid flapping
- Unit test: Cooldown period respected after account marked unhealthy
- Integration test: Account transitions through health states correctly
L5: Environment Override Tests
- Unit test:
OPENCODE_MULTI_AUTH_CODEX_AUTH_FILEoverrides default path - Integration test: Custom auth file path loads correctly
- Reliability change log + failing case reproductions now passing.
- Test coverage report showing >80% on modified files.
- Crash-recovery evidence documented in QA.md.
- Probe model compatibility hardening:
- pass
-c model_reasoning_effort="low"in probe exec - allow env override for probe effort
- probe model order prefers
gpt-5.3-codexfirst - fallback-retry on
unsupported_value/reasoning.effort
- pass
- Probe authority hardening:
- accept limits only from successful completed probe sessions
- reject failed/incomplete/usage-limit sessions as data sources
- preserve prior limits on probe failure
- update only error metadata (
limitStatus,limitError,lastLimitErrorAt) on failed probe
- Add freshness/confidence state (
fresh,stale,error,unknown) to API/UI. - Represent missing-authoritative-data as
unknown(not0%). - Add probe source diagnostics (model, source session/file, timestamp).
Pre-Test Setup:
- Sandbox environment isolated (
HOME=/tmp/oma-sandbox-limits, separate auth file) - Baseline snapshot captured of existing limits behavior
L1: Parser Authority Unit Tests
- Unit test: Parser accepts only fully successful probe sessions as authoritative
- Unit test: Parser rejects sessions with
token_counton error/failure - Unit test: Parser rejects incomplete/crashed probe sessions
- Unit test: Parser correctly identifies
unsupported_value/reasoning.efforterrors - Unit test: Probe fallback model ordering (
gpt-5.3-codexprioritized) - Unit test: Env override
OPENCODE_MULTI_AUTH_PROBE_EFFORTtakes precedence - Unit test: Default probe effort is
lowwhen not specified
L2: Limits Mutation Integration Tests
- Integration test: Failed probe cannot overwrite prior valid limits
- Integration test: Probe returning HTTP 200 but with error body does not mutate limits
- Integration test: Probe with partial
token_countdata on failure preserves prior limits - Integration test: Interrupted probe (SIGTERM during exec) leaves limits unchanged
- Integration test: No successful snapshot exists -> display shows
unknown - Integration test: Successful probe updates limits AND appends to
rateLimitHistory - Integration test: Concurrent probes cannot cause race condition in limits update
- Integration test: Limits history maintains max 100 entries, FIFO eviction
L3: Freshness/Confidence State Tests
- Unit test:
freshstate when probe succeeded within last 5 minutes - Unit test:
stalestate when probe succeeded 5-60 minutes ago - Unit test:
errorstate when last probe failed and < 60 min since success - Unit test:
unknownstate when no successful probe ever OR last success > 60 min - API test:
GET /api/accountsincludeslimitsConfidencefield with correct enum - API test:
GET /api/accountsincludeslastLimitProbeAtISO timestamp - API test:
GET /api/accountsincludeslimitStatusfield
L4: Model Compatibility & Error Handling
- Integration test:
xhighreasoning config does not break probe execution - Integration test:
unsupported_valueerror triggers retry withloweffort - Integration test: All retry attempts exhausted -> probe marked failed, limits preserved
- Integration test: Probe diagnostics include model name, timestamp, source log file
L5: Display Accuracy Tests
- Headless UI test: Account with no data shows "unknown" (not "0%")
- Headless UI test: Stale limits show visual indicator (badge/color change)
- Headless UI test: Error state shows error icon + tooltip with last error
- Headless UI test: Weekly/five-hour percentages reflect actual authoritative data
L6: Regression Prevention
- Test: Simulate old bug - probe with
token_counton failure -> verify limits unchanged - Test: Weekly and five-hour values never appear inconsistent after probe failure
- Test: Rapid successive probe failures don't corrupt stored limit history
L7: Contract Verification
- Verify
Limits Data Quality Contract(Section 4.3) all items pass - Verify
Permanent Limits Accuracy Requirements(Section 1.5) all items pass
- Before/after evidence for weekly/five-hour consistency behavior.
- Test execution log with all 25+ test cases passing.
- QA entry in
docs/QA.mdwith specific test commands and results.
- Add persisted account availability fields (
enabled,disabledAt,disabledBy,disableReason). - Exclude disabled accounts from all eligibility checks.
- Add APIs:
-
GET /api/accounts -
PUT /api/accounts/:alias/enabled -
POST /api/accounts/:alias/reauth
-
- Dashboard controls:
- iOS-like
Enabledswitch per account with statesoff|on|updating|error -
Re-authaction per account with statesidle|in-progress|success|error - disable toggle while request is in flight (double-submit safe)
- iOS-like
- Remove legacy disable button/control from dashboard.
- Ensure
Enabledswitch is the only disable mechanism. - Re-auth contract: only targeted alias credentials mutate.
Pre-Test Setup:
- Sandbox environment isolated (
HOME=/tmp/oma-sandbox-accounts, separate auth file) - Minimum 2 test accounts configured
- Legacy disable control removed and verified absent
L1: Store & Persistence Tests
- Unit test:
enabledfield persists and survives restart - Unit test:
disabledAttimestamp set on disable - Unit test:
disabledBycaptures actor identity - Unit test:
disableReasonoptional field stores correctly - Unit test: Disabled accounts excluded from rotation eligibility
- Unit test: Schema validation accepts new account fields
- Integration test: Disable one account, verify other accounts still rotate
- Persistence test: Restart process, verify disabled state preserved
L2: Enable/Disable API Tests
- API test:
PUT /api/accounts/:alias/enabledwithtrueenables account - API test:
PUT /api/accounts/:alias/enabledwithfalsedisables account - API test: Disable returns deterministic error for unknown alias (404)
- API test: Disable returns deterministic error for already disabled alias (409)
- API test: Toggle response includes updated account state
- API test:
GET /api/accountsincludesenabledfield for all accounts - API test:
GET /api/accountsincludes eligibility metadata - API test: Double-submit protection (concurrent toggles rejected)
L3: Re-auth API Tests
- API test:
POST /api/accounts/:alias/reauthtargets only specified alias - API test: Re-auth does not mutate other account credentials
- API test: Re-auth returns deterministic error for unknown alias
- API test: Re-auth returns deterministic error for disabled alias
- API test: Active alias re-auth maintains active auth pointer consistency
- API test: Re-auth progress state transitions correctly
- API test: Re-auth success updates credentials and metadata
- API test: Re-auth failure preserves previous credentials (no partial update)
L4: Dashboard UI Tests (Headless)
- UI test: Accounts table renders all aliases with
Enabledswitch - UI test: Switch state
offdisplays correctly for disabled accounts - UI test: Switch state
ondisplays correctly for enabled accounts - UI test: Switch state
updatingdisplays during toggle in-flight - UI test: Switch state
errordisplays on toggle failure - UI test: Double-click prevention (switch disabled during request)
- UI test: Toggle persists across page reload
- UI test: Legacy disable button/control is completely absent
- UI test: Re-auth button shows
idlestate initially - UI test: Re-auth button shows
in-progressduring re-auth - UI test: Re-auth button shows
successon completion - UI test: Re-auth button shows
erroron failure - UI test:
Enabledswitch is the ONLY disable mechanism in UI
L5: Eligibility & Rotation Tests
- Integration test: Disabled account never selected by rotation
- Integration test: Rotation skips disabled accounts in round-robin
- Integration test: All accounts disabled ->
NO_ELIGIBLE_ACCOUNTSerror - Integration test: Re-enabled account becomes eligible immediately
- Integration test: Health checks respect disabled state (no health checks for disabled)
L6: Contract Verification
- Verify
Store ContractSection 4.1 per-account fields pass - Verify
API ContractSection 4.2 alias-scoped mutation passes - Verify Section 1.3 custom changes (Enabled switch only disable mechanism) pass
L7: Security & Edge Cases
- Security test: Cannot disable last enabled account (prevent lockout)
- Security test: API rejects enable/disable for non-existent accounts
- Edge case: Rapid enable/disable toggles maintain consistency
- Edge case: Re-auth during active request handled gracefully
- Screenshots/test artifacts for account controls.
- Test execution log with all 35+ test cases passing.
- QA entry in
docs/QA.mdwith specific test commands and results. - Evidence that legacy disable control is removed.
- Persist force state (
forcedAlias,forcedUntil,previousRotationStrategy,forcedBy). - Force behavior:
- active + eligible forced alias always selected
- forced alias ineligible -> immediate clear + restore previous strategy
- forced alias manually disabled -> immediate clear + restore previous strategy
- forced alias removed -> auto-clear + restore
- expiry -> clear + restore
- TTL anchored to first activation (no extension on re-toggle)
- Force API:
-
GET /api/force -
POST /api/force -
POST /api/force/clear
-
- Force dashboard toggle (iOS-like) with states
off|on|updating|error. - Keep force toggle visually/behaviorally separate from account
Enabledswitches.
Pre-Test Setup:
- Sandbox environment isolated
- Minimum 3 test accounts configured
- Force toggle visually distinct from account
Enabledswitches
L1: Force State Persistence Tests
- Unit test:
forcedAliaspersists and survives restart - Unit test:
forcedUntilTTL calculated correctly (24h from activation) - Unit test:
previousRotationStrategystored correctly - Unit test:
forcedBycaptures actor identity - Integration test: Restart process, verify force state preserved
L2: Force Selection Logic Tests
- Unit test: Active + eligible forced alias always selected
- Unit test: Forced alias ineligible -> immediate clear + restore strategy
- Unit test: Forced alias disabled -> immediate clear + restore strategy
- Unit test: Forced alias removed -> auto-clear + restore
- Unit test: Expired force (
forcedUntilpassed) -> clear + restore - Unit test: TTL does NOT extend on re-toggle within 24h
- Unit test: Force cleared on new force activation (different alias)
L3: Force API Tests
- API test:
GET /api/forcereturns current force state - API test:
POST /api/forcewith alias activates force mode - API test: Force activation requires existing, enabled alias
- API test: Force activation stores
forcedUntiltimestamp - API test: Force activation stores
previousRotationStrategy - API test:
POST /api/force/cleardeactivates force and restores strategy - API test: Clear returns current strategy after restoration
- API test: API returns deterministic errors for invalid operations
L4: Force Dashboard UI Tests (Headless)
- UI test: Force toggle visually/behaviorally separate from account switches
- UI test: Force toggle state
offwhen no force active - UI test: Force toggle state
onwhen force active - UI test: Force toggle state
updatingduring activation/clear - UI test: Force toggle state
erroron operation failure - UI test: Force toggle shows forced alias name when active
- UI test: Force toggle shows remaining TTL when active
- UI test: Expired force automatically shows
offstate
L5: Integration & Rotation Tests
- Integration test: Rotation selects forced alias when active and eligible
- Integration test: Rotation returns to previous strategy after force cleared
- Integration test: Force survives across multiple requests
- Integration test: Decision logs include
forceStatefield
L6: Contract Verification
- Verify
Store ContractSection 4.1 force fields pass - Verify
Observability ContractSection 4.4 decision logs include force state - Verify Section 1.3 (force toggle separate from Enabled switch) passes
L7: Edge Cases & Lifecycle
- Edge case: Force activated when no accounts exist
- Edge case: Force activated with expired TTL
- Edge case: Force cleared when no force active
- Lifecycle test: Complete force lifecycle observed (activate -> use -> expire)
- Lifecycle test: Force survives 48h soak (Phase H requirement)
- Force lifecycle evidence including expiry/clear paths.
- Test execution log with all 30+ test cases passing.
- QA entry in
docs/QA.mdwith specific test commands and results.
- Add persisted settings model.
- Implement precedence: defaults -> persisted -> runtime config -> env.
- Implement
weighted-round-robinstrategy. - Add weighted presets: Balanced, Conservative, Aggressive.
- Add advanced validation:
- thresholds in
0..100 -
critical < low - weights in
(0,1] - minimum weight constraints
- thresholds in
- Add settings API (
GET/PUT/reset) and settings panel UI.
Pre-Test Setup:
- Sandbox environment isolated
- Minimum 3 test accounts with varying weights
L1: Settings Persistence Tests
- Unit test: Settings model persists and survives restart
- Unit test: Settings precedence: defaults -> persisted -> runtime -> env
- Unit test: Environment variables override persisted settings
- Unit test: Runtime config overrides persisted but not env
- Integration test: Restart process, verify settings preserved
L2: Settings Validation Tests
- Unit test: Threshold must be in range
0..100 - Unit test:
criticalthreshold must be< lowthreshold - Unit test: Weight must be in range
(0,1] - Unit test: Minimum weight constraints enforced
- Unit test: Invalid settings rejected with deterministic error
- API test:
PUT /api/settingsvalidates input and rejects invalid - API test: Validation errors include specific field and constraint
L3: Weighted Rotation Tests
- Unit test: Weighted distribution respects account weights
- Unit test: Higher weight accounts selected more frequently
- Unit test: All eligible accounts receive some traffic (no starvation)
- Unit test: Weighted calculation with 2+ accounts
- Unit test: Weighted calculation with 5+ accounts
- Integration test: Run 1000 rotations, verify distribution matches weights
- Integration test: Round-robin remains default unless explicitly changed
L4: Presets Tests
- Unit test: Balanced preset applies equal weights
- Unit test: Conservative preset applies lower thresholds
- Unit test: Aggressive preset applies higher thresholds
- API test: Preset application validates and applies settings
- UI test: Preset selection updates settings correctly
L5: Settings API Tests
- API test:
GET /api/settingsreturns current settings - API test:
PUT /api/settingsupdates settings atomically - API test: Settings update validates all constraints
- API test:
POST /api/settings/resetrestores defaults - API test: Reset clears persisted settings
- API test: Settings changes logged to audit log
L6: Dashboard Settings Panel Tests (Headless)
- UI test: Settings panel renders with current values
- UI test: Threshold inputs validate range
0..100 - UI test: Weight inputs validate range
(0,1] - UI test:
critical < lowconstraint enforced in UI - UI test: Save button applies settings and shows success
- UI test: Reset button restores defaults with confirmation
- UI test: Validation errors displayed inline
- UI test: Settings persist across page reload
L7: Contract Verification
- Verify round-robin remains default strategy
- Verify weighted strategy is opt-in only
- Verify settings precedence chain works correctly
L8: Edge Cases
- Edge case: All weights equal -> uniform distribution
- Edge case: One account weight = 1.0, others minimal -> heavy skew
- Edge case: Settings update during active rotation handled gracefully
- Weighted strategy behavior report with sample distributions.
- Test execution log with all 35+ test cases passing.
- QA entry in
docs/QA.mdwith specific test commands and results.
- Gate antigravity functionality behind feature flag (default off).
- Hide antigravity UI/panel unless enabled.
- Ensure non-core paths cannot affect core rotation by default.
Pre-Test Setup:
- Sandbox environment isolated
- Feature flag system implemented
L1: Feature Flag Unit Tests
- Unit test: Antigravity feature flag defaults to
false - Unit test: Flag can be enabled via environment variable
- Unit test: Flag can be enabled via settings API
- Unit test: Flag change persists across restart
L2: Flag-Off Behavior Tests
- Integration test: Core rotation works normally with flag off
- Integration test: No antigravity side effects on rotation decisions
- Integration test: No antigravity side effects on account selection
- Integration test: Store operations unaffected by antigravity code paths
- Integration test: Logs contain no antigravity-related entries when flag off
L3: Flag-On Behavior Tests
- Integration test: Antigravity feature reachable when flag enabled
- Integration test: Antigravity UI/panel visible when flag enabled
- Integration test: Core rotation still works correctly with flag on
- Integration test: Account selection respects antigravity rules when enabled
- Integration test: No core regression when antigravity active
L4: UI Isolation Tests (Headless)
- UI test: Antigravity panel absent when flag off
- UI test: Antigravity controls absent when flag off
- UI test: Antigravity panel present when flag on
- UI test: Flag toggle in settings works correctly
- UI test: No visual leakage of antigravity when flag off
L5: Security & Isolation Tests
- Security test: Antigravity cannot modify core rotation when flag off
- Security test: Antigravity cannot access sensitive data when flag off
- Security test: Flag change requires appropriate permissions
L6: Contract Verification
- Verify Section 3 Locked Decisions (antigravity off by default)
- Verify non-core paths do not affect core rotation by default
L7: Edge Cases
- Edge case: Flag toggled during active request
- Edge case: Flag enabled then disabled, verify clean state
- Feature-flag behavior matrix.
- Test execution log with all 20+ test cases passing.
- QA entry in
docs/QA.mdwith specific test commands and results.
- L0 Build/Type gates:
-
npm ci -
npm run lint(if script exists; otherwise document N/A) -
npm run build -
npx tsc --noEmit
-
- L1 Unit coverage:
- rotation math/fairness
- retry bounds
- store migration/validation
- account enable/disable
- re-auth alias scope
- force lifecycle rules
- limits compatibility + integrity +
unknown
- L2 Sandbox integration:
- isolated HOME/store/auth paths only
- account switch/re-auth/force workflows
- freshness/confidence behavior
- L3 Failure injection:
- 401/403 auth failures
- 429 rate limits
- 402 deactivated workspace
- 400 model unsupported
- probe session with token_count + final failure does not overwrite limits
- L4 Concurrency/stress:
- parallel bursts with no lost updates
- store remains consistent
- L5 Security/exposure:
- reject non-loopback bind
- deterministic errors for unknown/disabled alias mutations
- redaction in logs verified
- L6 Soak gate:
- 48h sandbox soak with periodic traffic
- no crashes, no stuck rotation, no corruption
- at least one full force lifecycle observed
- L7 Observability + SLO:
- required decision log schema present
- success rate >= 99.9%
- unhandled exceptions = 0
- p95 latency regression <= 10%
- failed probes never mutate limits
- Run
TEST_EXECUTION_PLAN.mdend-to-end with artifacts. - Repeat integration + headless runs 5 times to detect flakes.
- Complete QA evidence in
docs/QA.md.
- Rewrite
README.md(simple install/use). - Update
docs/TESTING.mdwith exact commands. - Update
docs/OFFLINE_RECOVERY.md. - Update
docs/QA.mdtroubleshooting matrix. - Keep
TEST_EXECUTION_PLAN.mdcurrent with scripts and gate order. - Promotion steps:
- capture baseline metrics
- 24h canary on round-robin
- weighted enablement
- 72h stability watch
- Rollback drill:
- revert to last known-good
- restore settings snapshot
- verify
list/status/switch/refresh
L1: Documentation Completeness
- Doc review:
README.mdincludes simple install/use instructions - Doc review:
README.mdincludes quickstart example - Doc review:
docs/TESTING.mdlists all exact test commands - Doc review:
docs/TESTING.mdexplains test execution order - Doc review:
docs/OFFLINE_RECOVERY.mdcovers store corruption recovery - Doc review:
docs/OFFLINE_RECOVERY.mdcovers auth file recovery - Doc review:
docs/QA.mdincludes troubleshooting matrix - Doc review:
docs/QA.mdincludes all test execution results - Doc review:
TEST_EXECUTION_PLAN.mdmatches actual test scripts
L2: Script Availability Verification
- Verify:
npm run lintscript exists and runs - Verify:
npm run buildscript exists and runs - Verify:
npm run test:unitscript exists and runs - Verify:
npm run test:integrationscript exists and runs - Verify:
npm run test:web:headlessscript exists and runs - Verify:
npm run test:failurescript exists and runs - Verify:
npm run test:stressscript exists and runs - Verify:
npm run test:sandboxscript exists and runs - Verify:
npm run test:soak:48hscript exists and runs
L3: Canary Testing
- Canary: 24h canary test started on round-robin strategy
- Canary: Metrics captured (success rate, latency, rotation distribution)
- Canary: Weighted strategy enablement test
- Canary: 72h stability watch initiated
- Canary: No critical errors during canary period
- Canary: Performance regression < 10%
L4: Rollback Drill
- Rollback: Revert to last known-good version documented
- Rollback: Settings snapshot restore procedure tested
- Rollback:
listcommand works after rollback - Rollback:
statuscommand works after rollback - Rollback:
switchcommand works after rollback - Rollback:
refreshcommand works after rollback - Rollback: Rollback completes within 5 minutes
- Rollback: Zero data loss during rollback
L5: Acceptance Criteria Verification
- Verify: Every required phase packet is complete (A-I)
- Verify: Shared contracts (Section 4) remain valid
- Verify: Full validation matrix (Phase H) is green
- Verify: Permanent limits accuracy requirements met
- Verify: Account lifecycle controls complete
- Verify: Promotion and rollback readiness complete
L6: Final QA Evidence
- Complete QA evidence collected in
docs/QA.md - All test execution logs archived
- All handoff outputs from phases A-I present
- Phase H: 5 consecutive integration runs all pass
- Phase H: 5 consecutive headless runs all pass
- Phase H: Stress test shows no lost updates
- Phase H: Soak test 48h completed successfully
L7: Release Readiness
- Security review complete (no exposed credentials)
- Performance SLOs met (Section 3.7)
- Rollback procedure documented and tested
- Monitoring and alerting configured
- Team trained on new features
- Release-readiness checklist signed off.
- Final QA evidence package in
docs/QA.md. - Test execution summary with all phases passing.
- Rollback procedure documented.
- Promotion metrics documented.
- Phase ID + date/time
- Files changed
- Commands run + pass/fail
- Known limitations/deferred items
- Risks introduced (if any)
- Next-phase recommendations
- Every required phase packet is complete.
- Shared contracts (Section 4) remain valid.
- Full validation matrix (Phase H) is green.
- Permanent limits accuracy requirements are met (
unknown+ failed-probe-safe behavior). - Account lifecycle controls are complete (
Enabledswitch replaces legacy disable; force remains separate). - Promotion and rollback readiness complete.
src/index.tssrc/rotation.tssrc/store.tssrc/types.tssrc/web.tssrc/auth.tssrc/codex-auth.tssrc/logger.tstest/(unit + integration + stress + recovery + headless)README.mddocs/TESTING.mddocs/OFFLINE_RECOVERY.mddocs/QA.mdTEST_EXECUTION_PLAN.md