You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to configure the read parallelism to be higher or is this a fundamental limitation of Hudi with MOR tables? What is the recommendation for how to tune resourcing for readers of MOR tables?
Environment Description
Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
Spark version : 3.5.1
Hive version : 3.1.3
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered:
@mzheng-plaid The parallelism should be equal to the number of file groups as we can have one task reading one parquet file. Want to understand more if the parquet files / log files are properly sized then whey you will face bottleneck at task level?
So with parquet there is spark.sql.files.maxPartitionBytes and if I'm understanding apache/iceberg#8922 (comment) correctly Iceberg also supports split size - does Hudi have any similar support for reading large files? As far as I can tell spark.sql.files.maxPartitionBytes is ignored
@mzheng-plaid I dont think HoodieMergeOnReadRDD has a way to split filegroups further during read. Any way it will be difficult with snapshot read, as log files has to be applied on the parquet records.
And just reading the parquet files directly was much less memory intensive and faster (ie. not spilling to disk) when I tuned spark.sql.files.maxPartitionBytes. I understand this will read multiple versions of the file groups but its surprising how much worse read performance is with Hudi.
Describe the problem you faced
We have jobs that read from a MOR table using the following pyspark pseudo-code (
event_table_rt
is the MOR table):We're running into a bottleneck on
HoodieMergeOnReadRDD
(https://github.com/apache/hudi/blob/release-0.14.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L37) where the number of tasks in the stage readingevent_df
seems to be non-configurable and (I think) equal to the number of files being read. This is causing massive disk/memory spill and bottlenecking performance.Is it possible to configure the read parallelism to be higher or is this a fundamental limitation of Hudi with MOR tables? What is the recommendation for how to tune resourcing for readers of MOR tables?
Environment Description
Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
Spark version : 3.5.1
Hive version : 3.1.3
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered: