Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Open
maheshguptags opened this issue Jan 30, 2025 · 6 comments
Open
Labels
data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink priority:critical production down; pipelines stalled; Need help asap.

Comments

@maheshguptags
Copy link

maheshguptags commented Jan 30, 2025

Issue

While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.

Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.

Hudi Properties

#Updated at 2025-01-20T07:41:05.654545Z
#Mon Jan 20 07:41:05 UTC 2025
hoodie.table.keygenerator.type=COMPLEX_AVRO
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=updated_date
hoodie.table.create.schema={}
hoodie.timeline.layout.version=2
hoodie.timeline.history.path=history
hoodie.table.checksum=1292384652
hoodie.datasource.write.drop.partition.columns=false
hoodie.record.merge.strategy.id=00000000-0000-0000-0000-000000000000
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.metadata.partitions.inflight=
hoodie.database.name=default_database
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.record.merge.mode=CUSTOM
hoodie.table.version=8
hoodie.compaction.payload.class=com.gupshup.cdp.PartialUpdate
hoodie.table.initial.version=8
hoodie.table.metadata.partitions=files
hoodie.table.partition.fields=xyz
hoodie.table.cdc.enabled=false
hoodie.archivelog.folder=history
hoodie.table.name=customer_temp
hoodie.table.recordkey.fields=xyz.abc 
hoodie.timeline.path=timeline

Steps to reproduce the behavior:

  1. Create a table with Flink hudi along with MDT Enable
  2. Ingest some load
  3. Try to delete one of TM Or Ingest heavy load so that it can give memory issue
  4. once it fails it will discard all the data after that checkpointing

Expected behavior

After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.

Environment Description

  • Hudi version : 1.0.0

  • Flink version: 1.18

  • Hive version : NO

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : Yes

  • Table Type: COPY_ON_WRITE

Additional context

Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?

Image

cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua

@danny0405
Copy link
Contributor

Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs?

@maheshguptags
Copy link
Author

Yes, Job works with and without auto-scale if we don't enable MDT.

@danny0405
Copy link
Contributor

Is there any sepcial logs in the JM logging?

@maheshguptags
Copy link
Author

I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that.

Thanks
Mahesh

@danny0405
Copy link
Contributor

it discards the data post that.

Are you saying the pipeline just hangs up there and does nothing?

@maheshguptags
Copy link
Author

maheshguptags commented Feb 5, 2025

No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3).

Thanks
Mahesh

Image

@ad1happy2go ad1happy2go added priority:critical production down; pipelines stalled; Need help asap. flink Issues related to flink data-loss loss of data only, use data-consistency label for inconsistent view labels Feb 7, 2025
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants