Skip to content

ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly#21483

Merged
yperbasis merged 1 commit into
mainfrom
yperbasis/hive-eest-failfast-mergequeue
May 28, 2026
Merged

ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly#21483
yperbasis merged 1 commit into
mainfrom
yperbasis/hive-eest-failfast-mergequeue

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Problem

When a merge-queue run has a hive-eest shard fail, the failing job calls gh run cancel ${{ github.run_id }} (added in #21445). That sends SIGTERM to all in-flight matrix siblings, but the Docker-bound hive simulators take ~20 minutes to actually drain. ci-gate is if: always() and waits for every needs job to reach a terminal state, so the broken PR sits at AWAITING_CHECKS for the full drain time — blocking the head of the merge queue.

Concrete example from today (PR #21470 at position #1):

  • 08:29:57 — hive-eest / test-hive-eest (paris+shanghai, serial) fails, calls gh run cancel 26562610423, emits the "Merge-queue root-cause failure" annotation from ci: surface merge-queue root cause when fail-fast cancels the run #21445.
  • 08:48 (~19 min later) — paris+shanghai-parallel, prague-serial/parallel, cancun-serial/parallel, osaka-parallel, rlp-serial/parallel, and glamsterdam-devnet-parallel were all still in_progress. Every other ci-gate child (tests, race-tests, eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had already completed.

The bottleneck was specifically the hive-eest matrix siblings.

Fix

strategy:
  fail-fast: ${{ github.event_name == 'merge_group' }}
  • In merge_group: first failed shard immediately cancels all siblings at the GitHub API layer — much faster than the gh run cancel → SIGTERM → runner-drain path. ci-gate's needs reach terminal state in seconds, ci-gate fails, the broken PR is evicted.
  • In PR runs: stays false, so authors still see the full failure breakdown across every shard. No regression in PR feedback.

What's left in place and why

The per-job gh run cancel step (test-hive-eest.yml lines 311-317) stays. Two reasons:

  • Matrix fail-fast only cancels siblings within the same matrix — it doesn't cancel sibling reusable workflows. If a future failure pattern leaks across workflows, gh run cancel still covers it.
  • ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that ran gh run cancel successfully" to single out the true root cause among collateral cancellations. Removing the step would silently regress ci: surface merge-queue root cause when fail-fast cancels the run #21445's attribution.

Scope choice

Only test-hive-eest.yml is changed. Other matrix-bearing reusable workflows (test-all-erigon.yml, test-all-erigon-race.yml, test-eest-spec.yml, test-kurtosis-assertoor.yml, test-hive.yml, test-bench.yml) all use fail-fast: false too, but none of them were the queue-blocking long pole in this incident. Keeping the patch minimal; we can generalize if another workflow becomes the bottleneck.

Tradeoff to be aware of

Queue runs will now show siblings as cancelled instead of failed whenever any one shard fails. That's the correct tradeoff in merge_group — the goal is fast eviction, not detailed diagnostics; full per-shard breakdown remains available on the PR run.

🤖 Generated with Claude Code

…ckly

In a merge-queue run, when one hive-eest shard fails, the failing job
calls `gh run cancel` (added in #21445). That requests cancellation, but
GitHub waits for in-flight matrix siblings to acknowledge SIGTERM — and
the Docker-bound hive simulators take ~20 min to wind down. ci-gate is
`if: always()` and waits for every `needs` job to reach terminal state,
so the broken PR stays at AWAITING_CHECKS for the full drain time and
blocks the head of the queue.

Setting `fail-fast: true` on merge_group lets GitHub cancel sibling
shards at the API layer the instant one shard fails (much faster than
the `gh run cancel` → SIGTERM → runner-drain path). ci-gate's `needs`
reach terminal state in seconds, ci-gate fails, the PR is evicted.

PR runs keep `fail-fast: false` so authors still see the full failure
breakdown across every shard.

The existing per-job `gh run cancel` step is left in place: it still
covers cross-workflow cancellation (matrix fail-fast only cancels
siblings within the same matrix), and the "leaf that ran gh run cancel
successfully" is what ci-gate.yml's root-cause annotator keys off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables matrix fail-fast for test-hive-eest.yml only in merge_group events to evict broken PRs from the merge queue faster, while preserving the full per-shard breakdown on PR runs.

Changes:

  • Replaces fail-fast: false with fail-fast: ${{ github.event_name == 'merge_group' }} on the test-hive-eest matrix.
  • Adds a comment explaining the merge-group vs. PR tradeoff.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — simple CI-only fail-fast tweak for merge_group, with PR behavior unchanged.

@yperbasis yperbasis enabled auto-merge May 28, 2026 10:45
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 53a1cbc May 28, 2026
92 checks passed
@yperbasis yperbasis deleted the yperbasis/hive-eest-failfast-mergequeue branch May 28, 2026 15:44
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 29, 2026
…ech#21498)

## Background

erigontech#21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to
`test-hive-eest.yml` only, with the explicit note that other
matrix-bearing reusable workflows could get the same treatment "if
another workflow becomes the bottleneck." It has — and there is also a
second problem erigontech#21483 didn't address: **GitHub does not auto-remove the
failed PR from the merge queue**.

CI Gate run
[26573584442](https://github.com/erigontech/erigon/actions/runs/26573584442)
for PR erigontech#21374 demonstrated both gaps:

- `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker
Hub blip (`alpine:latest` manifest HEAD returned `unknown:` while
building `hive/hiveproxy`).
- hive-eest's `fail-fast` cancelled siblings in **1 second**; `hive /
test-hive` (still on `fail-fast: false`) kept dispatching matrix legs —
`engine, api, serial` and `engine, cancun, parallel` started at 14:36:01
/ 14:36:17 (~7 min *after* `gh run cancel`) and ran to `success`.
ci-gate couldn't reach a terminal state until those finished at
14:43:54, delaying eviction by **~14 minutes**.
- Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub
did **not** remove PR erigontech#21374 from the queue: the entry stayed at
position 2 with state `UNMERGEABLE`. The queue only advanced because PR
erigontech#21483 was manually `jump`ed over it.

## Changes

### 1. Roll out merge_group fail-fast to the remaining matrix workflows

Same gating as erigontech#21483 (`${{ github.event_name == 'merge_group' }}`),
applied to:

- `test-hive.yml`
- `test-all-erigon.yml`
- `test-all-erigon-race.yml`
- `test-eest-spec.yml`
- `test-bench.yml`
- `test-kurtosis-assertoor.yml`

Behaviour matches erigontech#21483: in `merge_group`, first failed shard cancels
its siblings at the GitHub API layer (no waiting for runner drain); in
`pull_request` / `schedule` / `workflow_dispatch`, all shards continue
so authors keep the full per-shard breakdown.

### 2. Auto-dequeue UNMERGEABLE PRs whose required check failed

New step at the end of `ci-gate.yml`'s ci-gate job:

```yaml
- name: Dequeue failed merge-queue PR
  if: failure() && github.event_name == 'merge_group'
  ...
```

The step:

1. **Inspects `needs.*.result` and skips when all are `cancelled` with
no `failure`.** That pattern is a queue reshuffle (a PR ahead of us
merged, our merge-group SHA is stale), where GitHub re-creates a new
merge_group event for us; dequeuing here would be wrong. Confirmed by
run
[26573568764](https://github.com/erigontech/erigon/actions/runs/26573568764),
where ci-gate's job conclusion was `failure` (needs cancelled → `Check
all required jobs` exits 1) but the *run* was cancelled by GitHub during
a reshuffle.
2. Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>`
(handles multi-segment bases like `release/3.4`).
3. Resolves the PR number to a GraphQL node ID and calls the
`dequeuePullRequest` mutation. Soft-fails on errors (warning, not
non-zero exit) so a dequeue glitch never masks ci-gate's own failure
signal.

Permissions bumped from `pull-requests: read` to `pull-requests: write`
for the mutation.

## Why both in one PR

Both target the same incident class (broken PR sits at the head of the
queue blocking everything else). The fail-fast change shrinks
time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually
evicts the failed PR. Either alone is a partial fix — having both means
a broken PR's run goes red fast *and* the queue advances without anyone
needing to manually jump over it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants