fix(ci): make approve-test-queue tolerate races#15684
Open
ko3n1g wants to merge 1 commit into
Open
Conversation
Real failure mode:
A CICD run on a force-pushed PR head was left in 'waiting' status with a
non-empty /pending_deployments list, but POST to approve it 422'd ('No
pending deployment requests'). Every cron tick fetched the same orphan
first, crashed (KeyError on a non-existent 'id' key), and exit(1)'d.
With MAX_CONCURRENCY=1 the queue stayed wedged for hours.
Fix:
1. After a 422 POST, GET pending_deployments again to distinguish the
two cases:
- empty -> race: another approver won, skip silently.
- non-empty -> orphan/wedged run: cancel it via
POST /actions/runs/{id}/cancel so the queue can move on.
2. cancel_run() is a new helper: POST + raise_for_status, returns bool.
3. If a wedged run can't even be cancelled, collect a flag and exit(1)
at end-of-loop so the cron is loud about it (and other queue items
still get processed first).
4. KeyError on the error path is gone — the only place a deployment
id was printed referenced a key that doesn't exist on
/pending_deployments items.
Signed-off-by: oliver könig <okoenig@nvidia.com>
b6dcff8 to
a744f95
Compare
Contributor
|
[🤖]: Hi @ko3n1g 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
Why
The
Approve Test Queuecron has been failing every 5 minutes today withKeyError: 'id'followed byexit 1. Example: https://github.com/NVIDIA-NeMo/NeMo/actions/runs/25600162235/job/75152606750The actual failure mode wasn't a race — it was a wedged run. After a force-push on PR #15668, a CICD run on the discarded head was left in
waitingstatus.GET /actions/runs/<id>/pending_deploymentsreturned a single entry, butPOSTto approve it 422'd (No pending deployment requests to approve or reject). WithMAX_CONCURRENCY=1the cron picked the same orphan first every tick, crashed on adeployment['id']lookup (no such key on/pending_deploymentsitems), and never reached anything else.What
Distinguish race from orphan, then remediate the orphan:
cancel_run()helper:POST /actions/runs/{id}/cancel+raise_for_status.had_unrecoverableandexit(1)after the loop, so other queue items still get processed.KeyError: 'id'is gone; the only place we printeddeployment['id']referenced a key that does not exist on/pending_deploymentsitems.Verification