-
Notifications
You must be signed in to change notification settings - Fork 2.6k
DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think I'm following the logic here. Is there a case where you're not seeing columns being properly pruned? |
Thanks for your response. I am looking at TPCDS q16 physical plan for Iceberg on EMR. Link to q16 - https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql The physical plan looks like - https://gist.github.com/akshayakp97/102715c66eee44bc6f72493f427528f8 Line 46 projects only two columns from Upon further digging, I found out that |
After |
In general, if a |
@aokolnychyi are you aware of this issue? It looks like some additional pruning may be done after pushdown happens? |
Hi @rdblue and @aokolnychyi |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
Hi, @IgorBerman @akshayakp97 @rdblue , I also encountered this problem in last year, it is because the Spark3 DSV2 only prune column at V2ScanRelationPushDown, but the later RewriteSubquery rule generated new predicates which can be use to prune column, but Spark does not prune column again. |
thanks @Akeron-Zhu for the update! this improvement will be valuable for the community imo. ps: Our problem is more general due to highly nested schemas which spark not handles well in column pruning(think of array of structs inside array of structs etc). |
Uh oh!
There was an error while loading. Please reload this page.
Query engine
Query Engine: Spark 3.5.0
Apache Iceberg: 1.4.2
Question
Hi,
My understanding is that Spark Optimizer can add new
Project
operator even after V2 Relation was created. For example, it looks likeColumnPruning
optimizer rule triggers afterV2ScanRelationPushDown
here.If that's the case, then it would be expected that the columns projected by the newly added
Project
operator would prune the schema (for ex, ,like howV2ScanRelationPushDown#pruneColumns
does). But, I don't see schema pruning happening afterV2ScanRelationPushDown
for DatasourceV2. However, for DatasourceV1, I can see schema being pruned inFileSourceStrategy#apply
method beforeFileSourceScanExec
physical node is created.I don't see a similar logic in
DataSourceV2Strategy
to prune the relation's schema with the latestAttribute
's fromProject
's andFilter
's beforeBatchScanExec
is created.Is there a known gap with
DataSourceV2
?Thanks in advance!
The text was updated successfully, but these errors were encountered: