Skip to content

genesis: allow tests pass GenesisConfig to eth.New() - avoid double read of large genesis#21470

Open
AskAlexSharov wants to merge 14 commits into
mainfrom
alex/external_genesis_35
Open

genesis: allow tests pass GenesisConfig to eth.New() - avoid double read of large genesis#21470
AskAlexSharov wants to merge 14 commits into
mainfrom
alex/external_genesis_35

Conversation

@AskAlexSharov
Copy link
Copy Markdown
Collaborator

@AskAlexSharov AskAlexSharov commented May 28, 2026

  • NewEngineXTestRunner to allow pass config.Genesis object (because it already has it) - to avoid 2nd unmarshal of 200 MB genesis
go test -count=1 -run TestInvalidReceiptHashHighMgas -timeout=30s -v -cpuprofile=cpu_test.pprof -memprofile=mem_test.pprof ./execution/tests/

8.86s -> 6.34s

@AskAlexSharov AskAlexSharov requested a review from yperbasis as a code owner May 28, 2026 05:58
Copy link
Copy Markdown
Contributor

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM \u2014 obviously small/trivial change (22 lines).

@AskAlexSharov AskAlexSharov requested a review from mh0lt as a code owner May 28, 2026 06:15
@AskAlexSharov AskAlexSharov changed the title node/eth: skip ReadGenesis when genesis already provided to avoid large JSON deserialisation [wip] node/eth: skip ReadGenesis when genesis already provided to avoid large JSON deserialisation May 28, 2026
@AskAlexSharov AskAlexSharov changed the title [wip] node/eth: skip ReadGenesis when genesis already provided to avoid large JSON deserialisation [wip] genesis: allow tests to pass extern GenesisConfig to eth.New() - avoid double read of large genesis May 28, 2026
@AskAlexSharov AskAlexSharov requested a review from taratorio May 28, 2026 07:36
@AskAlexSharov AskAlexSharov enabled auto-merge May 28, 2026 07:36
@AskAlexSharov AskAlexSharov changed the title [wip] genesis: allow tests to pass extern GenesisConfig to eth.New() - avoid double read of large genesis [wip] genesis: allow tests pass GenesisConfig to eth.New() - avoid double read of large genesis May 28, 2026
@AskAlexSharov AskAlexSharov added this pull request to the merge queue May 28, 2026
@AskAlexSharov AskAlexSharov changed the title [wip] genesis: allow tests pass GenesisConfig to eth.New() - avoid double read of large genesis genesis: allow tests pass GenesisConfig to eth.New() - avoid double read of large genesis May 28, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 28, 2026
@Giulio2002 Giulio2002 added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…ckly (erigontech#21483)

## Problem

When a merge-queue run has a hive-eest shard fail, the failing job calls
`gh run cancel ${{ github.run_id }}` (added in erigontech#21445). That sends
SIGTERM to all in-flight matrix siblings, but the Docker-bound hive
simulators take ~20 minutes to actually drain. `ci-gate` is `if:
always()` and waits for every `needs` job to reach a terminal state, so
the broken PR sits at `AWAITING_CHECKS` for the full drain time —
blocking the head of the merge queue.

Concrete example from today (PR erigontech#21470 at position #1):

- 08:29:57 — `hive-eest / test-hive-eest (paris+shanghai, serial)`
fails, calls `gh run cancel 26562610423`, emits the "Merge-queue
root-cause failure" annotation from erigontech#21445.
- 08:48 (~19 min later) — paris+shanghai-parallel,
prague-serial/parallel, cancun-serial/parallel, osaka-parallel,
rlp-serial/parallel, and glamsterdam-devnet-parallel were all still
`in_progress`. Every other ci-gate child (tests, race-tests,
eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had
already completed.

The bottleneck was specifically the hive-eest matrix siblings.

## Fix

```yaml
strategy:
  fail-fast: ${{ github.event_name == 'merge_group' }}
```

- **In `merge_group`**: first failed shard immediately cancels all
siblings at the GitHub API layer — much faster than the `gh run cancel`
→ SIGTERM → runner-drain path. ci-gate's `needs` reach terminal state in
seconds, ci-gate fails, the broken PR is evicted.
- **In PR runs**: stays `false`, so authors still see the full failure
breakdown across every shard. No regression in PR feedback.

## What's left in place and why

The per-job `gh run cancel` step (test-hive-eest.yml lines 311-317)
stays. Two reasons:

- Matrix `fail-fast` only cancels siblings **within the same matrix** —
it doesn't cancel sibling reusable workflows. If a future failure
pattern leaks across workflows, `gh run cancel` still covers it.
- ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that
ran `gh run cancel` successfully" to single out the true root cause
among collateral cancellations. Removing the step would silently regress
erigontech#21445's attribution.

## Scope choice

Only `test-hive-eest.yml` is changed. Other matrix-bearing reusable
workflows (`test-all-erigon.yml`, `test-all-erigon-race.yml`,
`test-eest-spec.yml`, `test-kurtosis-assertoor.yml`, `test-hive.yml`,
`test-bench.yml`) all use `fail-fast: false` too, but none of them were
the queue-blocking long pole in this incident. Keeping the patch
minimal; we can generalize if another workflow becomes the bottleneck.

## Tradeoff to be aware of

Queue runs will now show siblings as `cancelled` instead of `failed`
whenever any one shard fails. That's the correct tradeoff in
`merge_group` — the goal is fast eviction, not detailed diagnostics;
full per-shard breakdown remains available on the PR run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Giulio2002 Giulio2002 added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@AskAlexSharov AskAlexSharov added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@Giulio2002 Giulio2002 added this pull request to the merge queue May 29, 2026
@taratorio taratorio removed this pull request from the merge queue due to a manual request May 29, 2026
Copy link
Copy Markdown
Member

@taratorio taratorio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR fails in the merge queue because it introduces a regression

https://github.com/erigontech/erigon/actions/runs/26588728733/job/78341615734

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants