Implement nested join optimization #3843

Dandandan · 2022-10-15T18:57:14Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For complex queries, like those in TCP-H and TCP-DS it is essential to find a good Join order.
HashBuildProbeOrder implements a rule to optimize the probe / build side of joins, but this only optimizes the joins locally (e.g. swapping the join left / right).

We should implement an algorithm that (tries to) find a (close to) global optimum based on the total estimated cost of the joins.

Describe the solution you'd like
Implement an efficient algorithm for optimizing
I'm not sure what the SOTA is on this. Some material I found with some Googling:

https://db.in.tum.de/teaching/ws1415/queryopt/chapter3.pdf
https://db.in.tum.de/~radke/papers/hugejoins.pdf
https://www.cockroachlabs.com/blog/join-ordering-pt1/
https://www.cockroachlabs.com/blog/join-ordering-ii-the-ikkbz-algorithm/
http://mlwiki.org/index.php/Join_Ordering

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

andygrove · 2022-10-27T17:16:41Z

We could also look at DuckDB join reordering: https://www.youtube.com/watch?v=aNRoR0Z3SzU

I filed a duplicate issue before I saw this one, although mine is specifically for the logical plan. #3984

cristian-ilies-vasile · 2023-04-21T07:16:05Z

Other resources:
Simplicity Done Right for Join Ordering - https://www.cidrdb.org/cidr2021/papers/cidr2021_paper01.pdf
Using the Join-Order-Benchmark (JOB), our simple approach provides better join orderings with significantly less optimization overhead, resulting in a substantially faster response time for all 113 JOB queries compared to state-of-the-art and recent approaches.

The MonetDB Architecture Martin Kersten CWI - https://homepages.cwi.nl/~manegold/teaching/adt/lectures/lecture2.pdf

maruschin · 2024-11-04T04:20:42Z

Hi, is there any progress?
I can take the issue for initial development.

clflushopt · 2025-02-07T04:36:37Z

Hi, I've been doing some reading on the side and I am interested into taking a stab at this if the issue is still open and no one is working on it.

clflushopt · 2025-02-07T04:43:24Z

take

clflushopt · 2025-02-09T03:24:30Z

Small update; I started looking at the initial join selection rule implemented by backtracking from the reference to HashBuildProbeOrder and went through the PR that introduced the JoinSelection rule in the physical optimizer #4219. I've also been looking at work done for the cost calculations and cost-based optimizations EPIC here #3929.

I've also started revisiting the no-statistics approach of DuckDB to try and get a better intuition for how their cardinality estimator approach works, implementation wise I think I want to get a small set of queries with nested joins running to get a better view of the current way they are handled before I draft an early implementation.

clflushopt · 2025-02-09T22:59:13Z

@alamb quick question what's considered higher priority here between better join ordering approach (potentially like DuckDB's) vs picking up the couple tickets left in the EPIC for #3929 (implementing interval analysis for AND and OR) ?

alamb · 2025-02-10T14:45:09Z

@alamb quick question what's considered higher priority here between better join ordering approach (potentially like DuckDB's) vs picking up the couple tickets left in the EPIC for #3929 (implementing interval analysis for AND and OR) ?

Hi @clflushopt

In my opinion getting Cardinality Estimation / interval analysis is higher priority as that feature has many use cases and is needed for almost any more sophisticated join ordering approach.

In terms of a more sophisticated join ordering approach, I have some ideas about this which I will try and write up over the nect week or two

clflushopt · 2025-02-11T03:48:01Z

Hey @alamb thanks for the clear answer, yes that sounds good ! It's seems that both ticket for interval boundary and selectivity analysis for AND & OR conjunctions seem open, I have an initial idea about how to implement both as I am going through #3845 and #3912 to get a better understanding of the analysis framework that builds AnalysisContext. I can take both tickets from the EPIC (for AND & OR) and start working on them.

alamb · 2025-02-12T22:35:22Z

Thanks @clflushopt

I don't have a great handle in my head on the current state of Boundary and Selectivity anaylsis. Maybe your first PRs could focus on adding some docs and examples to get warmed up and then we can start adding features (like for AND/OR)

clflushopt · 2025-02-16T02:55:08Z

Hey @alamb I have a small change in #14688 to demo boundary analysis (as I understand from the existing code), If this looks like a suitable initial example I can add one that demonstrates how AND conjunctions are analyzed and also OR conjunctions which are currently unsupported due to missing interval arithmetic for the OR operator. I also want to extensively document this part with some diagrams and example calculations but I am not sure whether this should go in docs/library or docs/contributor-guide ?

alamb · 2025-02-16T18:19:51Z

Hey @alamb I have a small change in #14688 to demo boundary analysis (as I understand from the existing code), If this looks like a suitable initial example I can add one that demonstrates how AND conjunctions are analyzed and also OR conjunctions which are currently unsupported due to missing interval arithmetic for the OR operator.

This sounds amazing -- I look forward to it

I also want to extensively document this part with some diagrams and example calculations

Likewise amazing

I am not sure whether this should go in docs/library or docs/contributor-guide ?

I recommend somewhere in https://datafusion.apache.org/library-user-guide/index.html

clflushopt · 2025-02-19T03:18:58Z

Hey @alamb following up on #14688 I made a new pull request in #14735 to add an example that demonstrates how analysis works for AND conjunctions and a placeholder for OR conjunctions (which I plan to think about and add support for next). I also added a short section in the Query Optimizer page to give a brief overview of the analysis API and the AnalysisContext.

alamb · 2025-04-28T20:43:07Z

BTW I wrote some thoughts on join ordering in a blog post (part 2)

alamb · 2025-04-28T21:54:57Z

I am trying to organize a join task force for planning joins / subqueries: [DISCUSSION] JOIN "task force" / project team #15885

Dandandan added enhancement New feature or request performance Make DataFusion faster optimizer Optimizer rules labels Oct 15, 2022

This was referenced Oct 19, 2022

Allow configuring statistics on TPC-H benchmarks #3888

Closed

Allow configuring collection of statistics during TPC-H benchmarks #3889

Merged

[EPIC] Improving cost calculations and cost based optimizations #3929

Open

github-actions bot assigned clflushopt Feb 7, 2025

alamb mentioned this issue Feb 23, 2025

Optimize nested joins #128

Closed

alamb mentioned this issue Apr 28, 2025

[DISCUSSION] JOIN "task force" / project team #15885

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement nested join optimization #3843

Implement nested join optimization #3843

Dandandan commented Oct 15, 2022 •

edited

Loading

andygrove commented Oct 27, 2022

cristian-ilies-vasile commented Apr 21, 2023 •

edited

Loading

maruschin commented Nov 4, 2024 •

edited

Loading

clflushopt commented Feb 7, 2025

clflushopt commented Feb 7, 2025

clflushopt commented Feb 9, 2025 •

edited

Loading

clflushopt commented Feb 9, 2025

alamb commented Feb 10, 2025

clflushopt commented Feb 11, 2025

alamb commented Feb 12, 2025

clflushopt commented Feb 16, 2025 •

edited

Loading

alamb commented Feb 16, 2025

clflushopt commented Feb 19, 2025

alamb commented Apr 28, 2025 •

edited

Loading

alamb commented Apr 28, 2025

Implement nested join optimization #3843

Implement nested join optimization #3843

Comments

Dandandan commented Oct 15, 2022 • edited Loading

andygrove commented Oct 27, 2022

cristian-ilies-vasile commented Apr 21, 2023 • edited Loading

maruschin commented Nov 4, 2024 • edited Loading

clflushopt commented Feb 7, 2025

clflushopt commented Feb 7, 2025

clflushopt commented Feb 9, 2025 • edited Loading

clflushopt commented Feb 9, 2025

alamb commented Feb 10, 2025

clflushopt commented Feb 11, 2025

alamb commented Feb 12, 2025

clflushopt commented Feb 16, 2025 • edited Loading

alamb commented Feb 16, 2025

clflushopt commented Feb 19, 2025

alamb commented Apr 28, 2025 • edited Loading

alamb commented Apr 28, 2025

Dandandan commented Oct 15, 2022 •

edited

Loading

cristian-ilies-vasile commented Apr 21, 2023 •

edited

Loading

maruschin commented Nov 4, 2024 •

edited

Loading

clflushopt commented Feb 9, 2025 •

edited

Loading

clflushopt commented Feb 16, 2025 •

edited

Loading

alamb commented Apr 28, 2025 •

edited

Loading