Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

maheshguptags · 2025-01-30T05:39:46Z

Issue

While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.

Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.

Hudi Properties

#Updated at 2025-01-20T07:41:05.654545Z
#Mon Jan 20 07:41:05 UTC 2025
hoodie.table.keygenerator.type=COMPLEX_AVRO
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=updated_date
hoodie.table.create.schema={}
hoodie.timeline.layout.version=2
hoodie.timeline.history.path=history
hoodie.table.checksum=1292384652
hoodie.datasource.write.drop.partition.columns=false
hoodie.record.merge.strategy.id=00000000-0000-0000-0000-000000000000
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.metadata.partitions.inflight=
hoodie.database.name=default_database
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.record.merge.mode=CUSTOM
hoodie.table.version=8
hoodie.compaction.payload.class=com.gupshup.cdp.PartialUpdate
hoodie.table.initial.version=8
hoodie.table.metadata.partitions=files
hoodie.table.partition.fields=xyz
hoodie.table.cdc.enabled=false
hoodie.archivelog.folder=history
hoodie.table.name=customer_temp
hoodie.table.recordkey.fields=xyz.abc 
hoodie.timeline.path=timeline

Steps to reproduce the behavior:

Create a table with Flink hudi along with MDT Enable
Ingest some load
Try to delete one of TM Or Ingest heavy load so that it can give memory issue
once it fails it will discard all the data after that checkpointing

Expected behavior

After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.

Environment Description

Hudi version : 1.0.0
Flink version: 1.18
Hive version : NO
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Table Type: COPY_ON_WRITE

Additional context

Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?

cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua

danny0405 · 2025-01-31T02:31:23Z

Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs?

maheshguptags · 2025-01-31T07:05:13Z

Yes, Job works with and without auto-scale if we don't enable MDT.

danny0405 · 2025-02-03T03:22:38Z

Is there any sepcial logs in the JM logging?

maheshguptags · 2025-02-03T05:40:21Z

I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that.

Thanks
Mahesh

danny0405 · 2025-02-05T00:27:54Z

it discards the data post that.

Are you saying the pipeline just hangs up there and does nothing?

maheshguptags · 2025-02-05T04:22:23Z

No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3).

Thanks
Mahesh

ad1happy2go added priority:critical production down; pipelines stalled; Need help asap. flink Issues related to flink data-loss loss of data only, use data-consistency label for inconsistent view labels Feb 7, 2025

ad1happy2go added this to Hudi Issue Support Feb 7, 2025

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

maheshguptags commented Jan 30, 2025 •

edited

Loading

danny0405 commented Jan 31, 2025

maheshguptags commented Jan 31, 2025

danny0405 commented Feb 3, 2025

maheshguptags commented Feb 3, 2025

danny0405 commented Feb 5, 2025

maheshguptags commented Feb 5, 2025 •

edited

Loading

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Comments

maheshguptags commented Jan 30, 2025 • edited Loading

danny0405 commented Jan 31, 2025

maheshguptags commented Jan 31, 2025

danny0405 commented Feb 3, 2025

maheshguptags commented Feb 3, 2025

danny0405 commented Feb 5, 2025

maheshguptags commented Feb 5, 2025 • edited Loading

maheshguptags commented Jan 30, 2025 •

edited

Loading

maheshguptags commented Feb 5, 2025 •

edited

Loading