Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Should we support fail-fast when heartbeat has been expired? #12522

Open
TheR1sing3un opened this issue Dec 19, 2024 · 2 comments
Open
Labels
writer-core Issues relating to core transactions/write actions

Comments

@TheR1sing3un
Copy link
Member

Consider the following case if the heartbeatIntervalInMs = 60 * 1000 and numTolerableHeartbeatMisses = 10, so maxAllowableHeartbeatIntervalInMs = 600 * 1000

  • 00:00,write application start
  • 00:01, 1st heartbeat send success
  • 00:02, The hdfs network is abnormal or other network causes, send heartbeat failed
  • 00:03-00:10, send heartbeat failed everytimes
  • 00:11, heartbeat is expired because currentTime[00:11] - lastHeartbeatTime[00:01] >= maxAllowableHeartbeatIntervalInMs, according to the code logic, lastHeartbeatTime will be never updated
  • 10:00, write application has been running for 10h to execute all the logic
  • 10:00, write application start to commit by BaseHoodieWriteClient::commitStats, but it find that heartbeat has been expired, so fail the application by throwing exception
image

So we spent 10 hours running an app that we knew at 00:11 was not going to be successful.
Should we support fail-fast to save some unnecessary resource consumption?

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@danny0405
Copy link
Contributor

The heartbeat is mainly designed for rollback when there are multiple writers.

Should we support fail-fast to save some unnecessary resource consumption?

Another choice is we always try to update the heartbeat no matter whether it's been expired or not.

@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Dec 20, 2024

Another choice is we always try to update the heartbeat no matter whether it's been expired or not.

Agree

@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 20, 2024
@ad1happy2go ad1happy2go added the writer-core Issues relating to core transactions/write actions label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
writer-core Issues relating to core transactions/write actions
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants