Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

MaxNevermind
Copy link

@MaxNevermind MaxNevermind commented Apr 28, 2024

This PR is a proposal and Work In Progress.

This is a simplified version of original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner

The simplified design:

  • has only one list of inputFilesToJoin instead of List<List<>> as in original PR
  • inputFilesToJoin is expected to have the same rowGroups ordering as in inputFiles, number of files in inputFiles and inputFilesToJoin is not necessarily has be the same, but ordering of rowGroups and the rowCount of paired rowGroups must be the same
  • joinColumnsOverwrite is used if the inputFilesToJoin is expected to overwrite column in inputFiles
  • all the capabilities that available for inputFiles like pruning, nullification, binary copy, now should be available for inputFilesToJoin too

maxim_konstantinov added 29 commits January 28, 2024 14:22
@MaxNevermind
Copy link
Author

MaxNevermind commented Apr 28, 2024

@wgtmac @ConeyLiu

This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: #1273 (comment)
I’ve limited the set of capabilities, see this PR description.
I’ve tired different ideas and it all come out as having too complex of implementation, so I decided to finalize at least something with as simple implementation as possible.
PR is not yet polished. Just wanted to do a quick overview of the new approach. If it looks good, I will polish it.

@wgtmac
Copy link
Member

wgtmac commented Apr 29, 2024

Thanks for your effort! I just took a quick glimpse and it does look simpler than the previous patch.

My general question is that now the prerequisite for users to use the joiner is to run a pre-processing like https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d you've mentioned. The pre-processing also takes time and resource. Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

@MaxNevermind
Copy link
Author

Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

This new implementation requires blocks to be aligned yes. The gist snippet preparing it need to be updated btw, that one is for the previous version.

I think this version strikes a good balance in terms of features vs implementation complexity, putting all the features as in previous version leads to a very complex implementation imo which I'm not sure is worth pursuing as it is mainly optimization for a pretty niche use-case, for regular use-cases you can just read write the whole thing, considering a niche use-case it is reasonable to assume for users to go extra mile and run that snippet and prepare the data in required way.

@wgtmac
Copy link
Member

wgtmac commented May 3, 2024

Yes, I agree that we can start from the implementation with the assumption that row groups of files are aligned. One thing that I'm not sure is that some users may not be easy to generate files to join with same row group alignment and any way require the rewrite tool to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants