Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738
Labels
data-loss
loss of data only, use data-consistency label for inconsistent view
flink
Issues related to flink
priority:critical
production down; pipelines stalled; Need help asap.
Issue
While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.
Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.
Hudi Properties
Steps to reproduce the behavior:
Expected behavior
After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.
Environment Description
Hudi version : 1.0.0
Flink version: 1.18
Hive version : NO
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Table Type: COPY_ON_WRITE
Additional context
Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?
cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua
The text was updated successfully, but these errors were encountered: