-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you add a test to show an example query that gets benefit from this change? |
It seems that we have generated a new ScanBuilder. Does this affect the filters that have been pushed down through V2ScanRelationPushDown? |
Thanks for your question. It will not affect. Because this rule will check the Scan by comparing Scan's columns and columns required for Project and Filter, if there are no needless columns, it will do nothing. Otherwise, it will generate a new Scan and finally add the Project and Filter on the Scan if necessary. In summary, it only change the scan. |
It seems that it could also resolve the issue described in SPARK-51831. Can the case described in SPARK-51831 be modified into a test case? |
Thank you! @LuciferYang , it really helps. |
After running fllow test case, I found no filter is pushed down to Scan, although columns are pruned. @Akeron-Zhu Could you check for it?
Before this PR:
DataSource V2:
After this PR:
|
Oh, I see, It is my mistake, I thought pushed filter is fliters in plan, Thanks for point out. I will check it later. |
Hello teachers, @cloud-fan @jackylee-ch @LuciferYang , I submitted and tested my another solution SPARK-50873-2 today. This solution only addresses the problem in SPARK-51831 which caused by EXISTS subquery. The solution rewrite "SELECT *" as "SELECT 1" in WHERE EXISTS during optimization phase. Because all these problems in TPCDS are caused by using "SELECT *" in EXISTS. I have submitted the PR, but it currently cannot pass the TPCDSV1_4-PLanStability and TPCDSV1_4-PLanStabilityWithStats tests, because after this rule, some columns may be swapped and ID may be changed, it is different from the plan file in source code, but the plan and answer is right. Like the picture show: After @jackylee-ch reminder, I found that my first solution (this PR) cannot push down operators such as filters and aggregate. If push down all operators, I need to rewrite V2ScanRelationPushDown rule completely because SCAN cannot be modified after generated, the SCAN interface in spark does not provide modify function. |
Hello, Jackylee, I found that my first solution cannot push down operators such as filters and aggregate. If push down all operators, I need to rewrite V2ScanRelationPushDown rule completely because SCAN cannot be modified after generated, the SCAN interface in spark does not provide modify function. So I submitted other solutions SPARK-50873-2 & SPARK-52186. |
What changes were proposed in this pull request?
This PR offers an optimize rule for SparkOptimizer to prune unnecessary column for DataSourceV2 (DSV2) after RewriteSubquery.
Spark 3 use V2ScanRelationPushDown rule to prune column for DSV2. However, if there are subquerys in the qeuery sql, RewriteSubery rule will be generated new predicates which can be use to prune column after executed V2ScanRelationPushDown, but Spark does not prune column again which cause lower performance.
See the issue for more detail description : SPARK-50873
Why are the changes needed?
A better performance for Spark DSV2.
For example, in 10T TPCDS test, the query16 execution time will be reduced by 50% from 2.5min to 1.3min in my cluster.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
GitHub Actions.
Was this patch authored or co-authored using generative AI tooling?
No.