Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049

Merged
merged 2 commits into from
Aug 28, 2024
Merged

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Jun 11, 2024

What changes were proposed in this pull request?

Support read hudi COW tables. This PR is updated based on PR. Thanks @xushiyan for contribution.

(Fixes: #5471)

How was this patch tested?

new UTs added

Copy link

#5471

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@yma11
Copy link
Contributor Author

yma11 commented Jun 12, 2024

@xushiyan please help take a review. cc @leesf.

@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister {
/**
* The class name that used to identify what kind of datasource this is。
*
* For DataSource V1, it should be the child class name of
* [[org.apache.spark.sql.execution.datasources.FileIndex]].
* For DataSource V1, it should be relation.fileFormat like
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YannByron For org.apache.spark.sql.execution.datasources.FileIndex, it can be used to distinguish different datasources but it's too general that all kinds of files read will pass, such as meta data/log files used for query plan analysis. It is not necessary and may trigger failures in some corner cases. So here, we limit it to the parquet format, are you okay for this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanClassName is a flag to decide which data source transformer should be triggered during register, such as delta, iceberg, hudi, etc. It's used to be determined by value of org.apache.spark.sql.execution.datasources.FileIndex, now in this PR we change to relation.fileFormat.

override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat

override protected def doValidateInternal(): ValidationResult = {
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I am not fully understand the logic here, why schema contains _hoodie do not support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question. can you add doc to the method to clarify under what condition the validation should fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the discussion from the previous PR, we can support

  • COW table, regardless of populateMetaField being true or false
  • MOR table for read-optimized query

We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.

Copy link
Contributor Author

@yma11 yma11 Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We exclude support for this schema for safety. As I tested, the schemas like _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name can work fine. But I am not sure whether other schemas are okay. We used to have problem in this part in other formats. I'm okay to remove this check if you can confirm there is no problem in other schemas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan any more questions?

@leesf
Copy link
Contributor

leesf commented Jun 17, 2024

@yma11 Thanks for the MR, do we also tested MOR table with Read Optimized Queries or only support COW? please refer to https://hudi.apache.org/docs/next/table_types/

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 can you please address the comments? TY

override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat

override protected def doValidateInternal(): ValidationResult = {
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question. can you add doc to the method to clarify under what condition the validation should fail?

pom.xml Outdated
@@ -155,7 +156,8 @@
<iceberg.version>1.3.1</iceberg.version>
<delta.package.name>delta-core</delta.package.name>
<delta.version>2.3.0</delta.version>
<delta.binary.version>23</delta.binary.version>
<delta.binary.version>23</delta.binary.version>
<hudi.version>0.14.1</hudi.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can update this to 0.15.0

disableBucketedScan
) {

override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be easy to add orc support too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. As I know velox doesn't support ORC yet.

@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister {
/**
* The class name that used to identify what kind of datasource this is。
*
* For DataSource V1, it should be the child class name of
* [[org.apache.spark.sql.execution.datasources.FileIndex]].
* For DataSource V1, it should be relation.fileFormat like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?

override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat

override protected def doValidateInternal(): ValidationResult = {
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the discussion from the previous PR, we can support

  • COW table, regardless of populateMetaField being true or false
  • MOR table for read-optimized query

We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 1, 2024

Run Gluten Clickhouse CI

@Amar1404
Copy link

Hi @yma11 @xushiyan - Any plan for pushing this to master branch for production use case

@xushiyan
Copy link
Member

Hi @yma11 @xushiyan - Any plan for pushing this to master branch for production use case

@Amar1404 yes we should land this soon. cc @leesf @YannByron please also give input on the PR when you have a chance.

@github-actions github-actions bot added the DOCS label Aug 21, 2024
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

Run Gluten Clickhouse CI

@yma11 yma11 merged commit f545929 into apache:main Aug 28, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core DOCS INFRA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Support read Hudi COW
5 participants