Skip to content

General framework to decorrelate the subqueries #5492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #5483 ...
mingmwang opened this issue Mar 7, 2023 · 5 comments · May be fixed by #16016
Open
Tracked by #5483 ...

General framework to decorrelate the subqueries #5492

mingmwang opened this issue Mar 7, 2023 · 5 comments · May be fixed by #16016
Labels
enhancement New feature or request

Comments

@mingmwang
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In the current DataFusion, it has very limited support for correlated subqueries. It can only decorrelate the (NOT) IN/Exists predicate subqueries to Semi/Anti Joins. Even in the simplest IN/Exists cases, if the correlated expressions are not in the Filter/Join conditions, the current decorrelate rules will not support them.

In the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It raise a mechanism to unnest arbitrary queries. This was already implemented by the Hyper DB:

For example:
select * from orders
where 1 in (select 1 from part left join (select l_partkey from lineitem where o_orderkey = 2) lineitem on p_partkey = lineitem.l_partkey)

https://hyper-db.de/interface.html#

Both SparkSQL and PostgreSQL do not support decorrelate such kind of queries.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@duongcongtoai
Copy link
Contributor

According to the discussions in this issue, i think we can list the following items to support a subqueries decorrelation framework:

  • Unify the optimizor for correlated query, regardless the query type (exists query, scalar query etc)
  • Support flexible decorrelation scheme (simple vs general approach), we can achieve this by following the algorithm mentioned in the 2nd paper. There is a prerequisite to introduce an index algebra during the rewrite. This index requires a pre-traversing over the whole query to detect all non-trivial subqueries, and answer the question whether simple unnesting is sufficient, or should the framework continue with the general approach
  • Implement general purpose + recursive aware subquery decorrelation for the most major operators (projection, filter, group by) using the top-down algorithm mentioned in the 2nd paper
  • Gradually support more complex expression (group by, order, limit, window function)

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@suibianwanwank
Copy link
Contributor

suibianwanwank commented Apr 29, 2025

Unify the optimizor for correlated query, regardless the query type (exists query, scalar query etc)

I think a crucial starting point is to transform all correlated exprs into dependent join. (I'm unsure if we need to implement its Execution Plan before completing all decorrelation).

@alamb
Copy link
Contributor

alamb commented May 15, 2025

In case others haven't heard, @irenjj is working on additional subquery support as part of a Google Summer of Code Project (where @jayzhan211 and I are helping mentor).

Perhaps @suibianwanwank and @duongcongtoai would be interested in being involved too

@duongcongtoai
Copy link
Contributor

duongcongtoai commented May 15, 2025

cool, actually in this PR: #16016 i'm trying to introduce a unified structure to do decorrelation for the following usecases:

  • simple decorrelating plan with linear operators (straight forward projection/predicate pull up/ dependent join push down)
  • more complex decorrelation (group by, windows, limit ...)
  • recursive decorrelation/nested decorrelation

The parts that are not implemented are left with unimplemented! errors. I'm trying to make the PR mergable, so everyone can continue to contribute in parallel, but my current goal is to implement enough to satisfy existing slt tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants