[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

yma11 · 2024-06-11T14:21:35Z

What changes were proposed in this pull request?

Support read hudi COW tables. This PR is updated based on PR. Thanks @xushiyan for contribution.

(Fixes: #5471)

How was this patch tested?

new UTs added

github-actions · 2024-06-11T14:21:54Z

#5471

github-actions · 2024-06-11T14:22:07Z

Run Gluten Clickhouse CI

github-actions · 2024-06-11T22:59:02Z

Run Gluten Clickhouse CI

github-actions · 2024-06-12T01:19:52Z

Run Gluten Clickhouse CI

github-actions · 2024-06-12T01:23:33Z

Run Gluten Clickhouse CI

yma11 · 2024-06-12T01:25:20Z

@xushiyan please help take a review. cc @leesf.

yma11 · 2024-06-12T01:38:49Z

gluten-core/src/main/scala/org/apache/gluten/execution/DataSourceScanTransformerRegister.scala

@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister {
  /**
   * The class name that used to identify what kind of datasource this is。
   *
-   * For DataSource V1, it should be the child class name of
-   * [[org.apache.spark.sql.execution.datasources.FileIndex]].
+   * For DataSource V1, it should be relation.fileFormat like


@YannByron For org.apache.spark.sql.execution.datasources.FileIndex, it can be used to distinguish different datasources but it's too general that all kinds of files read will pass, such as meta data/log files used for query plan analysis. It is not necessary and may trigger failures in some corner cases. So here, we limit it to the parquet format, are you okay for this change?

not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?

scanClassName is a flag to decide which data source transformer should be triggered during register, such as delta, iceberg, hudi, etc. It's used to be determined by value of org.apache.spark.sql.execution.datasources.FileIndex, now in this PR we change to relation.fileFormat.

leesf · 2024-06-17T13:33:31Z

gluten-hudi/src/main/scala/org/apache/gluten/execution/HudiScanTransformer.scala

+  override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat
+
+  override protected def doValidateInternal(): ValidationResult = {
+    if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {


sorry, I am not fully understand the logic here, why schema contains _hoodie do not support?

same question. can you add doc to the method to clarify under what condition the validation should fail?

from the discussion from the previous PR, we can support

COW table, regardless of populateMetaField being true or false

MOR table for read-optimized query

We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.

We exclude support for this schema for safety. As I tested, the schemas like _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name can work fine. But I am not sure whether other schemas are okay. We used to have problem in this part in other formats. I'm okay to remove this check if you can confirm there is no problem in other schemas.

@xushiyan any more questions?

leesf · 2024-06-17T13:35:27Z

@yma11 Thanks for the MR, do we also tested MOR table with Read Optimized Queries or only support COW? please refer to https://hudi.apache.org/docs/next/table_types/

xushiyan

@yma11 can you please address the comments? TY

xushiyan · 2024-07-22T17:50:54Z

gluten-hudi/src/main/scala/org/apache/gluten/execution/HudiScanTransformer.scala

+  override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat
+
+  override protected def doValidateInternal(): ValidationResult = {
+    if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {


same question. can you add doc to the method to clarify under what condition the validation should fail?

xushiyan · 2024-07-22T17:51:10Z

pom.xml

@@ -155,7 +156,8 @@
        <iceberg.version>1.3.1</iceberg.version>
        <delta.package.name>delta-core</delta.package.name>
        <delta.version>2.3.0</delta.version>
-        <delta.binary.version>23</delta.binary.version>
+	<delta.binary.version>23</delta.binary.version>
+	<hudi.version>0.14.1</hudi.version>


we can update this to 0.15.0

xushiyan · 2024-07-22T17:52:22Z

gluten-hudi/src/main/scala/org/apache/gluten/execution/HudiScanTransformer.scala

+    disableBucketedScan
+  ) {
+
+  override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat


will it be easy to add orc support too?

Not quite. As I know velox doesn't support ORC yet.

xushiyan · 2024-07-22T17:54:02Z

gluten-core/src/main/scala/org/apache/gluten/execution/DataSourceScanTransformerRegister.scala

@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister {
  /**
   * The class name that used to identify what kind of datasource this is。
   *
-   * For DataSource V1, it should be the child class name of
-   * [[org.apache.spark.sql.execution.datasources.FileIndex]].
+   * For DataSource V1, it should be relation.fileFormat like


not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?

xushiyan · 2024-07-22T18:01:58Z

gluten-hudi/src/main/scala/org/apache/gluten/execution/HudiScanTransformer.scala

+  override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat
+
+  override protected def doValidateInternal(): ValidationResult = {
+    if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {


from the discussion from the previous PR, we can support

COW table, regardless of populateMetaField being true or false

MOR table for read-optimized query

We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.

github-actions · 2024-07-23T05:21:07Z

Run Gluten Clickhouse CI

github-actions · 2024-07-30T12:25:30Z

Run Gluten Clickhouse CI

github-actions · 2024-07-31T02:58:18Z

Run Gluten Clickhouse CI

github-actions · 2024-07-31T09:02:34Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T03:31:28Z

Run Gluten Clickhouse CI

github-actions · 2024-08-01T05:58:18Z

Run Gluten Clickhouse CI

Amar1404 · 2024-08-14T16:47:14Z

Hi @yma11 @xushiyan - Any plan for pushing this to master branch for production use case

xushiyan · 2024-08-20T16:13:37Z

Hi @yma11 @xushiyan - Any plan for pushing this to master branch for production use case

@Amar1404 yes we should land this soon. cc @leesf @YannByron please also give input on the PR when you have a chance.

github-actions · 2024-08-21T02:14:31Z

Run Gluten Clickhouse CI

github-actions · 2024-08-21T05:13:45Z

Run Gluten Clickhouse CI

leesf

LGTM

github-actions · 2024-08-28T01:00:20Z

Run Gluten Clickhouse CI

yma11 force-pushed the hudi branch from 9431101 to bc4f630 Compare June 11, 2024 22:58

yma11 force-pushed the hudi branch from bc4f630 to 0e507d6 Compare June 12, 2024 01:19

yma11 force-pushed the hudi branch from 0e507d6 to 1ac829c Compare June 12, 2024 01:23

yma11 commented Jun 12, 2024

View reviewed changes

leesf reviewed Jun 17, 2024

View reviewed changes

xushiyan reviewed Jul 22, 2024

View reviewed changes

yma11 force-pushed the hudi branch from 1ac829c to 4ac3caf Compare July 23, 2024 05:20

yma11 force-pushed the hudi branch from 4ac3caf to 9ad8c2a Compare July 30, 2024 12:25

yma11 force-pushed the hudi branch from 9ad8c2a to 55b661f Compare July 31, 2024 02:57

yma11 force-pushed the hudi branch from 55b661f to 1148afd Compare July 31, 2024 07:20

yma11 force-pushed the hudi branch from 1148afd to 8df23b6 Compare August 1, 2024 03:30

yma11 force-pushed the hudi branch from 8df23b6 to faf7c26 Compare August 1, 2024 05:57

yma11 force-pushed the hudi branch from faf7c26 to eea4b77 Compare August 21, 2024 02:13

github-actions bot added CORE works for Gluten Core INFRA labels Aug 21, 2024

github-actions bot added the DOCS label Aug 21, 2024

yma11 force-pushed the hudi branch from eea4b77 to ba9544c Compare August 21, 2024 05:13

leesf approved these changes Aug 27, 2024

View reviewed changes

yma11 force-pushed the hudi branch from ba9544c to 52ece1f Compare August 28, 2024 00:59

zhztheplayer approved these changes Aug 28, 2024

View reviewed changes

yma11 merged commit f545929 into apache:main Aug 28, 2024
43 checks passed

xushiyan and others added 2 commits August 28, 2024 16:32

[GLUTEN-5471][VL]feat: Support read Hudi COW table

6cbc9e0

Refine code and fix UTs

52ece1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

yma11 commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

yma11 commented Jun 12, 2024

yma11 Jun 12, 2024

xushiyan Jul 22, 2024

yma11 Jul 23, 2024

leesf Jun 17, 2024

xushiyan Jul 22, 2024

xushiyan Jul 22, 2024

yma11 Jul 23, 2024 •

edited

Loading

yma11 Aug 1, 2024

leesf commented Jun 17, 2024

xushiyan left a comment

xushiyan Jul 22, 2024

xushiyan Jul 22, 2024

xushiyan Jul 22, 2024

yma11 Jul 23, 2024

xushiyan Jul 22, 2024

xushiyan Jul 22, 2024

github-actions bot commented Jul 23, 2024

github-actions bot commented Jul 30, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Amar1404 commented Aug 14, 2024

xushiyan commented Aug 20, 2024

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

leesf left a comment

github-actions bot commented Aug 28, 2024

[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

Conversation

yma11 commented Jun 11, 2024

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

yma11 commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leesf commented Jun 17, 2024

xushiyan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 23, 2024

github-actions bot commented Jul 30, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 1, 2024

Amar1404 commented Aug 14, 2024

xushiyan commented Aug 20, 2024

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

leesf left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 28, 2024

yma11 Jul 23, 2024 •

edited

Loading