Skip to content

Implement nested join optimization #3843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #15885
Dandandan opened this issue Oct 15, 2022 · 15 comments
Open
Tracked by #15885

Implement nested join optimization #3843

Dandandan opened this issue Oct 15, 2022 · 15 comments
Assignees
Labels
enhancement New feature or request optimizer Optimizer rules performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Oct 15, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For complex queries, like those in TCP-H and TCP-DS it is essential to find a good Join order.
HashBuildProbeOrder implements a rule to optimize the probe / build side of joins, but this only optimizes the joins locally (e.g. swapping the join left / right).

We should implement an algorithm that (tries to) find a (close to) global optimum based on the total estimated cost of the joins.

Describe the solution you'd like
Implement an efficient algorithm for optimizing
I'm not sure what the SOTA is on this. Some material I found with some Googling:

https://db.in.tum.de/teaching/ws1415/queryopt/chapter3.pdf
https://db.in.tum.de/~radke/papers/hugejoins.pdf
https://www.cockroachlabs.com/blog/join-ordering-pt1/
https://www.cockroachlabs.com/blog/join-ordering-ii-the-ikkbz-algorithm/
http://mlwiki.org/index.php/Join_Ordering

Describe alternatives you've considered

Additional context

@andygrove
Copy link
Member

We could also look at DuckDB join reordering: https://www.youtube.com/watch?v=aNRoR0Z3SzU

I filed a duplicate issue before I saw this one, although mine is specifically for the logical plan. #3984

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Apr 21, 2023

Other resources:
Simplicity Done Right for Join Ordering - https://www.cidrdb.org/cidr2021/papers/cidr2021_paper01.pdf
Using the Join-Order-Benchmark (JOB), our simple approach provides better join orderings with significantly less optimization overhead, resulting in a substantially faster response time for all 113 JOB queries compared to state-of-the-art and recent approaches.

The MonetDB Architecture Martin Kersten CWI - https://homepages.cwi.nl/~manegold/teaching/adt/lectures/lecture2.pdf

@maruschin
Copy link
Contributor

maruschin commented Nov 4, 2024

Hi, is there any progress?
I can take the issue for initial development.

@clflushopt
Copy link
Contributor

Hi, I've been doing some reading on the side and I am interested into taking a stab at this if the issue is still open and no one is working on it.

@clflushopt
Copy link
Contributor

take

@clflushopt
Copy link
Contributor

clflushopt commented Feb 9, 2025

Small update; I started looking at the initial join selection rule implemented by backtracking from the reference to HashBuildProbeOrder and went through the PR that introduced the JoinSelection rule in the physical optimizer #4219. I've also been looking at work done for the cost calculations and cost-based optimizations EPIC here #3929.

I've also started revisiting the no-statistics approach of DuckDB to try and get a better intuition for how their cardinality estimator approach works, implementation wise I think I want to get a small set of queries with nested joins running to get a better view of the current way they are handled before I draft an early implementation.

@clflushopt
Copy link
Contributor

@alamb quick question what's considered higher priority here between better join ordering approach (potentially like DuckDB's) vs picking up the couple tickets left in the EPIC for #3929 (implementing interval analysis for AND and OR) ?

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

@alamb quick question what's considered higher priority here between better join ordering approach (potentially like DuckDB's) vs picking up the couple tickets left in the EPIC for #3929 (implementing interval analysis for AND and OR) ?

Hi @clflushopt

In my opinion getting Cardinality Estimation / interval analysis is higher priority as that feature has many use cases and is needed for almost any more sophisticated join ordering approach.

In terms of a more sophisticated join ordering approach, I have some ideas about this which I will try and write up over the nect week or two

@clflushopt
Copy link
Contributor

Hey @alamb thanks for the clear answer, yes that sounds good ! It's seems that both ticket for interval boundary and selectivity analysis for AND & OR conjunctions seem open, I have an initial idea about how to implement both as I am going through #3845 and #3912 to get a better understanding of the analysis framework that builds AnalysisContext. I can take both tickets from the EPIC (for AND & OR) and start working on them.

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

Thanks @clflushopt

I don't have a great handle in my head on the current state of Boundary and Selectivity anaylsis. Maybe your first PRs could focus on adding some docs and examples to get warmed up and then we can start adding features (like for AND/OR)

@clflushopt
Copy link
Contributor

clflushopt commented Feb 16, 2025

Hey @alamb I have a small change in #14688 to demo boundary analysis (as I understand from the existing code), If this looks like a suitable initial example I can add one that demonstrates how AND conjunctions are analyzed and also OR conjunctions which are currently unsupported due to missing interval arithmetic for the OR operator. I also want to extensively document this part with some diagrams and example calculations but I am not sure whether this should go in docs/library or docs/contributor-guide ?

@alamb
Copy link
Contributor

alamb commented Feb 16, 2025

Hey @alamb I have a small change in #14688 to demo boundary analysis (as I understand from the existing code), If this looks like a suitable initial example I can add one that demonstrates how AND conjunctions are analyzed and also OR conjunctions which are currently unsupported due to missing interval arithmetic for the OR operator.

This sounds amazing -- I look forward to it

I also want to extensively document this part with some diagrams and example calculations

Likewise amazing

I am not sure whether this should go in docs/library or docs/contributor-guide ?

I recommend somewhere in https://datafusion.apache.org/library-user-guide/index.html

@clflushopt
Copy link
Contributor

Hey @alamb following up on #14688 I made a new pull request in #14735 to add an example that demonstrates how analysis works for AND conjunctions and a placeholder for OR conjunctions (which I plan to think about and add support for next). I also added a short section in the Query Optimizer page to give a brief overview of the analysis API and the AnalysisContext.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

BTW I wrote some thoughts on join ordering in a blog post (part 2)

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimizer Optimizer rules performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

6 participants