[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

MaxNevermind · 2024-04-28T22:36:42Z

This PR is a proposal and Work In Progress.

This is a simplified version of original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner

The simplified design:

has only one list of inputFilesToJoin instead of List<List<>> as in original PR
inputFilesToJoin is expected to have the same rowGroups ordering as in inputFiles, number of files in inputFiles and inputFilesToJoin is not necessarily has be the same, but ordering of rowGroups and the rowCount of paired rowGroups must be the same
joinColumnsOverwrite is used if the inputFilesToJoin is expected to overwrite column in inputFiles
all the capabilities that available for inputFiles like pruning, nullification, binary copy, now should be available for inputFilesToJoin too

MaxNevermind · 2024-04-28T22:37:13Z

This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: #1273 (comment)
I’ve limited the set of capabilities, see this PR description.
I’ve tired different ideas and it all come out as having too complex of implementation, so I decided to finalize at least something with as simple implementation as possible.
PR is not yet polished. Just wanted to do a quick overview of the new approach. If it looks good, I will polish it.

wgtmac · 2024-04-29T05:17:48Z

Thanks for your effort! I just took a quick glimpse and it does look simpler than the previous patch.

My general question is that now the prerequisite for users to use the joiner is to run a pre-processing like https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d you've mentioned. The pre-processing also takes time and resource. Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

MaxNevermind · 2024-04-29T16:31:57Z

Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

This new implementation requires blocks to be aligned yes. The gist snippet preparing it need to be updated btw, that one is for the previous version.

I think this version strikes a good balance in terms of features vs implementation complexity, putting all the features as in previous version leads to a very complex implementation imo which I'm not sure is worth pursuing as it is mainly optimization for a pretty niche use-case, for regular use-cases you can just read write the whole thing, considering a niche use-case it is reasonable to assume for users to go extra mile and run that snippet and prepare the data in required way.

wgtmac · 2024-05-03T11:04:06Z

Yes, I agree that we can start from the implementation with the assumption that row groups of files are aligned. One thing that I'm not sure is that some users may not be easy to generate files to join with same row group alignment and any way require the rewrite tool to handle this.

maxim_konstantinov added 29 commits January 28, 2024 14:22

add initial ParquetJoiner implementation

f5144b2

add initial ParquetJoiner implementation

01a08dd

Merge remote-tracking branch 'origin/master' into add-parquet-joiner

28c987c

refactor ParquetJoiner implementation

7ae3505

extend the main test for multiple files on the right

05eb22a

extend the main test for multiple files on the right

6bb950d

Merge branch 'master' into add-parquet-joiner

87b923c

converge join logic, crate a draft of options and rewriter

f9536c3

move ParquetJoinTest logic to ParquetRewriterTest

d7f11d9

improve Parquet stitching test

e8e7ffe

remove custom ParquetRewriter constructor

3ee946c

remove custom ParquetRewriter constructor

fd409c4

refactor ParquetRewriter

5a98219

apply spotless and address PR comments

7b2fd1a

move extra column writing into processBlocksFromReader

8da8291

add getInputFiles back

68e41ba

Merge remote-tracking branch 'fork/master' into add-parquet-joiner

98b9b23

fix extra ParquetRewriter constructor so tests can pass

6d2c222

remove not needed TODOs

883e935

address PR comments

8ef36b5

Merge remote-tracking branch 'origin/master' into add-parquet-joiner

79cc2b8

rename inputFilesR to inputFilesToJoin

0bbf72f

rename inputFilesR to inputFilesToJoinColumns

ca53bff

add getParquetInputFiles listing to the rewrite start logging

1e7998a

redesign file joiner in ParquetRewriter

2ee9b40

Merge remote-tracking branch 'origin/master' into add-parquet-joiner-v2

fc32dfd

redesign file joiner in ParquetRewriter

db52c85

redesign file joiner in ParquetRewriter

9057e91

redesign file joiner in ParquetRewriter

5b055c0

uncomment some code

b70f88f

fix ParquetRewriter joiner test

270126b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Apr 28, 2024 •

edited

Loading

wgtmac commented Apr 29, 2024

MaxNevermind commented Apr 29, 2024

wgtmac commented May 3, 2024

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

Are you sure you want to change the base?

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

Conversation

MaxNevermind commented Apr 28, 2024 • edited Loading

MaxNevermind commented Apr 28, 2024 • edited Loading

wgtmac commented Apr 29, 2024

MaxNevermind commented Apr 29, 2024

wgtmac commented May 3, 2024

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Apr 28, 2024 •

edited

Loading