ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly#21483
Merged
Conversation
…ckly In a merge-queue run, when one hive-eest shard fails, the failing job calls `gh run cancel` (added in #21445). That requests cancellation, but GitHub waits for in-flight matrix siblings to acknowledge SIGTERM — and the Docker-bound hive simulators take ~20 min to wind down. ci-gate is `if: always()` and waits for every `needs` job to reach terminal state, so the broken PR stays at AWAITING_CHECKS for the full drain time and blocks the head of the queue. Setting `fail-fast: true` on merge_group lets GitHub cancel sibling shards at the API layer the instant one shard fails (much faster than the `gh run cancel` → SIGTERM → runner-drain path). ci-gate's `needs` reach terminal state in seconds, ci-gate fails, the PR is evicted. PR runs keep `fail-fast: false` so authors still see the full failure breakdown across every shard. The existing per-job `gh run cancel` step is left in place: it still covers cross-workflow cancellation (matrix fail-fast only cancels siblings within the same matrix), and the "leaf that ran gh run cancel successfully" is what ci-gate.yml's root-cause annotator keys off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Enables matrix fail-fast for test-hive-eest.yml only in merge_group events to evict broken PRs from the merge queue faster, while preserving the full per-shard breakdown on PR runs.
Changes:
- Replaces
fail-fast: falsewithfail-fast: ${{ github.event_name == 'merge_group' }}on thetest-hive-eestmatrix. - Adds a comment explaining the merge-group vs. PR tradeoff.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Giulio2002
approved these changes
May 28, 2026
Contributor
Giulio2002
left a comment
There was a problem hiding this comment.
LGTM — simple CI-only fail-fast tweak for merge_group, with PR behavior unchanged.
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 29, 2026
…ech#21498) ## Background erigontech#21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to `test-hive-eest.yml` only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem erigontech#21483 didn't address: **GitHub does not auto-remove the failed PR from the merge queue**. CI Gate run [26573584442](https://github.com/erigontech/erigon/actions/runs/26573584442) for PR erigontech#21374 demonstrated both gaps: - `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker Hub blip (`alpine:latest` manifest HEAD returned `unknown:` while building `hive/hiveproxy`). - hive-eest's `fail-fast` cancelled siblings in **1 second**; `hive / test-hive` (still on `fail-fast: false`) kept dispatching matrix legs — `engine, api, serial` and `engine, cancun, parallel` started at 14:36:01 / 14:36:17 (~7 min *after* `gh run cancel`) and ran to `success`. ci-gate couldn't reach a terminal state until those finished at 14:43:54, delaying eviction by **~14 minutes**. - Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub did **not** remove PR erigontech#21374 from the queue: the entry stayed at position 2 with state `UNMERGEABLE`. The queue only advanced because PR erigontech#21483 was manually `jump`ed over it. ## Changes ### 1. Roll out merge_group fail-fast to the remaining matrix workflows Same gating as erigontech#21483 (`${{ github.event_name == 'merge_group' }}`), applied to: - `test-hive.yml` - `test-all-erigon.yml` - `test-all-erigon-race.yml` - `test-eest-spec.yml` - `test-bench.yml` - `test-kurtosis-assertoor.yml` Behaviour matches erigontech#21483: in `merge_group`, first failed shard cancels its siblings at the GitHub API layer (no waiting for runner drain); in `pull_request` / `schedule` / `workflow_dispatch`, all shards continue so authors keep the full per-shard breakdown. ### 2. Auto-dequeue UNMERGEABLE PRs whose required check failed New step at the end of `ci-gate.yml`'s ci-gate job: ```yaml - name: Dequeue failed merge-queue PR if: failure() && github.event_name == 'merge_group' ... ``` The step: 1. **Inspects `needs.*.result` and skips when all are `cancelled` with no `failure`.** That pattern is a queue reshuffle (a PR ahead of us merged, our merge-group SHA is stale), where GitHub re-creates a new merge_group event for us; dequeuing here would be wrong. Confirmed by run [26573568764](https://github.com/erigontech/erigon/actions/runs/26573568764), where ci-gate's job conclusion was `failure` (needs cancelled → `Check all required jobs` exits 1) but the *run* was cancelled by GitHub during a reshuffle. 2. Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>` (handles multi-segment bases like `release/3.4`). 3. Resolves the PR number to a GraphQL node ID and calls the `dequeuePullRequest` mutation. Soft-fails on errors (warning, not non-zero exit) so a dequeue glitch never masks ci-gate's own failure signal. Permissions bumped from `pull-requests: read` to `pull-requests: write` for the mutation. ## Why both in one PR Both target the same incident class (broken PR sits at the head of the queue blocking everything else). The fail-fast change shrinks time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually evicts the failed PR. Either alone is a partial fix — having both means a broken PR's run goes red fast *and* the queue advances without anyone needing to manually jump over it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a merge-queue run has a hive-eest shard fail, the failing job calls
gh run cancel ${{ github.run_id }}(added in #21445). That sends SIGTERM to all in-flight matrix siblings, but the Docker-bound hive simulators take ~20 minutes to actually drain.ci-gateisif: always()and waits for everyneedsjob to reach a terminal state, so the broken PR sits atAWAITING_CHECKSfor the full drain time — blocking the head of the merge queue.Concrete example from today (PR #21470 at position #1):
hive-eest / test-hive-eest (paris+shanghai, serial)fails, callsgh run cancel 26562610423, emits the "Merge-queue root-cause failure" annotation from ci: surface merge-queue root cause when fail-fast cancels the run #21445.in_progress. Every other ci-gate child (tests, race-tests, eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had already completed.The bottleneck was specifically the hive-eest matrix siblings.
Fix
merge_group: first failed shard immediately cancels all siblings at the GitHub API layer — much faster than thegh run cancel→ SIGTERM → runner-drain path. ci-gate'sneedsreach terminal state in seconds, ci-gate fails, the broken PR is evicted.false, so authors still see the full failure breakdown across every shard. No regression in PR feedback.What's left in place and why
The per-job
gh run cancelstep (test-hive-eest.yml lines 311-317) stays. Two reasons:fail-fastonly cancels siblings within the same matrix — it doesn't cancel sibling reusable workflows. If a future failure pattern leaks across workflows,gh run cancelstill covers it.gh run cancelsuccessfully" to single out the true root cause among collateral cancellations. Removing the step would silently regress ci: surface merge-queue root cause when fail-fast cancels the run #21445's attribution.Scope choice
Only
test-hive-eest.ymlis changed. Other matrix-bearing reusable workflows (test-all-erigon.yml,test-all-erigon-race.yml,test-eest-spec.yml,test-kurtosis-assertoor.yml,test-hive.yml,test-bench.yml) all usefail-fast: falsetoo, but none of them were the queue-blocking long pole in this incident. Keeping the patch minimal; we can generalize if another workflow becomes the bottleneck.Tradeoff to be aware of
Queue runs will now show siblings as
cancelledinstead offailedwhenever any one shard fails. That's the correct tradeoff inmerge_group— the goal is fast eviction, not detailed diagnostics; full per-shard breakdown remains available on the PR run.🤖 Generated with Claude Code