-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] FileNotFoundException while querying HUDI table via native Spark SQL with HMS as catalog #12477
Comments
@ahujaanmol1288 Are you using the concurrency control if you are using multiple writers? |
I confirmed after checking code, you are not using those. Please refer https://hudi.apache.org/docs/concurrency_control/ |
@ad1happy2go Hey Aditya, the issue is not related to multiple writers, it is actually the read that fails. My assumption is that in Spark SQL we are unable to set |
You could set it as a spark session config.
Generally speaking, Hudi guarantees snapshot isolation between writers and readers through its timeline and multi-version concurrency control. And Hudi does not delete the last version of any data file unless the cleaner is configured that way (your configs suggest no change to the default cleaner configs). I would like to understand more about your use case and also how the file is getting deleted? Are you using OSS Hudi or EMR Hudi? If it's the latter, did you also try with the 0.15.0 version of OSS Hudi? Could you zip the We have many production use cases with concurrent read and write scenario, and data freshness latency of just a few minutes. For example - https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/ If it's just single writer and multiple readers, Hudi employs MVCC by default. I will need to review the script shared above to understand further what's going on. |
@ahujaanmol1288 @Wrekkers when i checked the code we are writing in multiple threads parallely without any concurrency control so its going to corrupt the table. So read may have issues. |
@ahujaanmol1288 As @codope mentioned, Can you please share the hoodie timeline to confirm the issue. |
@ad1happy2go Pls find the hoodie timeline attached, also we are using OSS Hudi |
Describe the problem you faced
While reading hudi table via spark sql the job fails with a java.io.FileNotFoundException. This error occurs when the underlying hudi table is updated while the read operation is underway i.e. (spark sql read operation started -> write operation finished -> error in completing spark sql read operation). Indicating that the write operation updated the underlying data files and deleted the earlier S3 file specified by the Hoodie File Index.
Spark Config Used :
"spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.jars.packages": "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1", "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension", "hive.metastore.uris": "<metastore URI>"
spark commnads:
spark.sql("SELECT * FROM <hive_schema>.<hudi_table_name>")
To Reproduce
Steps to reproduce the behavior:
Configure the Hudi environment with the following settings:
Hudi version: 0.13.1
Spark version: 3.3
Hive version: 2.4
Storage: S3
Use the following Hudi write options:
'hoodie.table.name': 'hudi_trips_cow1', 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.datasource.write.partitionpath.field': 'partitionpath', 'hoodie.datasource.write.table.name': 'hudi_trips_cow1', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'hudi_trips_cow1'
Implement concurrent read and write operations using concurrent.futures.ThreadPoolExecutor.
Read operation: Perform a SQL query on the Hudi table and checkpoint the result.
Write operation: Generate and insert new records using the Hudi DataGenerator and write them to the same Hudi table.
Execute the code and observe the failure during the checkpointing step of the read operation.
Please refer this script to reproduce the issue:
read_while_update.txt
Expected behavior
The concurrent read and write operations should execute without any errors. The read operation should successfully checkpoint the results, and the write operation should upsert data to the Hudi table.
Environment Description
Hudi version : 0.13.1
Spark version : 3.3
Hive version : 2.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Tested on AWS EMR (6.11.1)
Additional context
The issue might be related to file consistency in S3 during concurrent operations or checkpointing with a Hudi table on S3.
Stacktrace
The text was updated successfully, but these errors were encountered: