-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335
base: master
Are you sure you want to change the base?
Conversation
This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: #1273 (comment) |
Thanks for your effort! I just took a quick glimpse and it does look simpler than the previous patch. My general question is that now the prerequisite for users to use the joiner is to run a pre-processing like https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d you've mentioned. The pre-processing also takes time and resource. Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task? |
This new implementation requires blocks to be aligned yes. The gist snippet preparing it need to be updated btw, that one is for the previous version. I think this version strikes a good balance in terms of features vs implementation complexity, putting all the features as in previous version leads to a very complex implementation imo which I'm not sure is worth pursuing as it is mainly optimization for a pretty niche use-case, for regular use-cases you can just read write the whole thing, considering a niche use-case it is reasonable to assume for users to go extra mile and run that snippet and prepare the data in required way. |
Yes, I agree that we can start from the implementation with the assumption that row groups of files are aligned. One thing that I'm not sure is that some users may not be easy to generate files to join with same row group alignment and any way require the rewrite tool to handle this. |
This PR is a proposal and Work In Progress.
This is a simplified version of original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner
The simplified design:
inputFilesToJoin
instead ofList<List<>>
as in original PRinputFilesToJoin
is expected to have the same rowGroups ordering as ininputFiles
, number of files ininputFiles
andinputFilesToJoin
is not necessarily has be the same, but ordering of rowGroups and the rowCount of paired rowGroups must be the samejoinColumnsOverwrite
is used if theinputFilesToJoin
is expected to overwrite column ininputFiles
inputFiles
like pruning, nullification, binary copy, now should be available forinputFilesToJoin
too