Skip to content

fix(ci): make approve-test-queue tolerate races#15684

Open
ko3n1g wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ko3n1g:ko3n1g/fix/approve-queue-race
Open

fix(ci): make approve-test-queue tolerate races#15684
ko3n1g wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ko3n1g:ko3n1g/fix/approve-queue-race

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented May 9, 2026

Claude summary

Why

The Approve Test Queue cron has been failing every 5 minutes today with KeyError: 'id' followed by exit 1. Example: https://github.com/NVIDIA-NeMo/NeMo/actions/runs/25600162235/job/75152606750

The actual failure mode wasn't a race — it was a wedged run. After a force-push on PR #15668, a CICD run on the discarded head was left in waiting status. GET /actions/runs/<id>/pending_deployments returned a single entry, but POST to approve it 422'd (No pending deployment requests to approve or reject). With MAX_CONCURRENCY=1 the cron picked the same orphan first every tick, crashed on a deployment['id'] lookup (no such key on /pending_deployments items), and never reached anything else.

What

Distinguish race from orphan, then remediate the orphan:

result = make_request(deployment_url, method="POST", data=status_data)
if result:
    total_workflows += 1
    continue

# POST failed. Re-GET to tell race vs orphan apart.
deployments_after = make_request(deployment_url) or []
if not deployments_after:
    # benign race — someone else approved between our GET and POST
    continue

# wedged: pending_deployments still lists it, but POST refuses
if cancel_run(workflow_id):
    print(f"Cancelled wedged run {workflow_id}")
else:
    had_unrecoverable = True
  • New cancel_run() helper: POST /actions/runs/{id}/cancel + raise_for_status.
  • If a wedged run can't be cancelled (e.g. token lacks scope), set had_unrecoverable and exit(1) after the loop, so other queue items still get processed.
  • Initial GET that returns an empty list is also handled (skip silently — same race window, just earlier).
  • KeyError: 'id' is gone; the only place we printed deployment['id'] referenced a key that does not exist on /pending_deployments items.

Verification

  • The wedged run that triggered today's outage (25597959104) has been cancelled manually; the next 5-min cron tick will be the first reproduction-free one.
  • After this PR lands, the same orphan scenario from a future force-push will be auto-cleared on the next tick.

@ko3n1g ko3n1g requested a review from a team as a code owner May 9, 2026 11:53
@github-actions github-actions Bot added the CI label May 9, 2026
Real failure mode:
A CICD run on a force-pushed PR head was left in 'waiting' status with a
non-empty /pending_deployments list, but POST to approve it 422'd ('No
pending deployment requests'). Every cron tick fetched the same orphan
first, crashed (KeyError on a non-existent 'id' key), and exit(1)'d.
With MAX_CONCURRENCY=1 the queue stayed wedged for hours.

Fix:
1. After a 422 POST, GET pending_deployments again to distinguish the
   two cases:
     - empty  -> race: another approver won, skip silently.
     - non-empty -> orphan/wedged run: cancel it via
       POST /actions/runs/{id}/cancel so the queue can move on.
2. cancel_run() is a new helper: POST + raise_for_status, returns bool.
3. If a wedged run can't even be cancelled, collect a flag and exit(1)
   at end-of-loop so the cron is loud about it (and other queue items
   still get processed first).
4. KeyError on the error path is gone — the only place a deployment
   id was printed referenced a key that doesn't exist on
   /pending_deployments items.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g force-pushed the ko3n1g/fix/approve-queue-race branch from b6dcff8 to a744f95 Compare May 9, 2026 11:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

[🤖]: Hi @ko3n1g 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant