Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exec Layer node getLogs requests stop after EL restart #6048

Open
jakubgs opened this issue Mar 8, 2024 · 8 comments
Open

Exec Layer node getLogs requests stop after EL restart #6048

jakubgs opened this issue Mar 8, 2024 · 8 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented Mar 8, 2024

Describe the bug
While working on Geth upgrade for a production Beacon Node running v24.2.0-742f15 a weird behavior was discovered when the Geth node was restarted.

The getLogs requests stopped from the beacon node:

image

And the eth1_chain_len metric stopped raising.

image

When the BN was restarted the getLogs requests resumed. This behavior was reproduced when Geth was restarted again:

image

Additional context
Logs can be provided upon request from the fleet.

@etan-status
Copy link
Contributor

This node is outdated and must be updated for Nimbus >= v24.2.1 before Deneb (5 days deadline).

@jakubgs
Copy link
Member Author

jakubgs commented Mar 8, 2024

Good point. We just finished upgrading EL nodes, will do BN next.

@etan-status
Copy link
Contributor

Searching the logs for Failed to obtain the latest block from the EL is only logged one time. After that, there are still periodic syncEth1Chain tick that start new eth_getBlockByNumber('latest', false) requests, but these subsequent ones never seem to finish / time out / fail.

    debug "syncEth1Chain tick"

    if bnStatus == BeaconNodeStatus.Stopping:
      await m.stop()
      return

    if m.eth1Chain.hasConsensusViolation:
      raise newException(CorruptDataProvider, "Eth1 chain contradicts Eth2 consensus")

    let latestBlock = try:
      raiseIfNil connection.trackedRequestWithTimeout(
        "getBlockByNumber",
        rpcClient.eth_getBlockByNumber(blockId("latest"), false),
        web3RequestsTimeout)
    except CatchableError as err:
      warn "Failed to obtain the latest block from the EL", err = err.msg
      raise err
  • bnStatus == BeaconNodeStatus.Stopping cannot be because that is only set on exit
  • m.eth1Chain.hasConsensusViolation cannot be because the locations that set it to true produce a unique log line
  • connection.trackedRequestWithTimeout is still repeatedly hit, as seen by Sending message to RPC server log line immediately after each syncEth1Chain tick log

Maybe something weird in trackedRequestWithTimeout related to or that makes it get stuck:

@etan-status
Copy link
Contributor

etan-status commented Mar 8, 2024

Do you have eth1_latest_head metric from that time as well? And also eth1_synced_head

@etan-status
Copy link
Contributor

Extended the logs here as well to have a better understanding what's going on:

@jakubgs
Copy link
Member Author

jakubgs commented Mar 8, 2024

Do you have eth1_latest_head metric from that time as well?

Not on the dashboard but it can be generated:

image

And also eth1_synced_head

image

@yakimant
Copy link
Member

Multi-EL is documented officially without syaing it's unstable/beta:
https://nimbus.guide/eth1.html#running-multiple-execution-clients

@etan-status, @tersec, @jakubgs, maybe we can try running multi-EL again?

BTW, we are running multi-EL on nimbus.mainnet:
https://github.com/status-im/infra-nimbus/blob/42cfc2958a39cc6ec316f1f46d45189cf9ff995f/ansible/group_vars/nimbus.mainnet.yml#L81

@jakubgs
Copy link
Member Author

jakubgs commented Aug 21, 2024

As far as I know multi-EL setups work fine on Holesky, but I didn't research it that closely. The question is more for Nimbus devs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants