Skip to content

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
akshayakp97 opened this issue Dec 11, 2023 · 9 comments
Open

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

akshayakp97 opened this issue Dec 11, 2023 · 9 comments

Comments

@akshayakp97
Copy link
Contributor

akshayakp97 commented Dec 11, 2023

Query engine

Query Engine: Spark 3.5.0
Apache Iceberg: 1.4.2

Question

Hi,

My understanding is that Spark Optimizer can add new Project operator even after V2 Relation was created. For example, it looks like ColumnPruning optimizer rule triggers after V2ScanRelationPushDown here.

If that's the case, then it would be expected that the columns projected by the newly added Project operator would prune the schema (for ex, ,like how V2ScanRelationPushDown#pruneColumns does). But, I don't see schema pruning happening after V2ScanRelationPushDown for DatasourceV2. However, for DatasourceV1, I can see schema being pruned in FileSourceStrategy#apply method before FileSourceScanExec physical node is created.

I don't see a similar logic in DataSourceV2Strategy to prune the relation's schema with the latest Attribute's from Project's and Filter's before BatchScanExec is created.

Is there a known gap with DataSourceV2?

Thanks in advance!

@rdblue
Copy link
Contributor

rdblue commented Dec 11, 2023

I don't think I'm following the logic here. Is there a case where you're not seeing columns being properly pruned?

@akshayakp97
Copy link
Contributor Author

Thanks for your response.

I am looking at TPCDS q16 physical plan for Iceberg on EMR.

Link to q16 - https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql

The physical plan looks like - https://gist.github.com/akshayakp97/102715c66eee44bc6f72493f427528f8

Line 46 projects only two columns from Project [cs_warehouse_sk#54840, cs_order_number#54843L], however it looks like Iceberg is scanning all columns for the catalog_sales table in Line 47.

Upon further digging, I found out that ColumnPruning rule adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L] operator, but we still see all columns read by the corresponding BatchScanExec.

@akshayakp97
Copy link
Contributor Author

After ColumnPruning adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L], when V2ScanRelationPushDown rule triggers, it doesn't match the ScanOperation pattern, because, the scan operator for catalog_sales in the logical plan seems to have been updated in the previous iterations of V2ScanRelationPushDown rule - which resulted in the conversion of ScanBuilderHolder to DataSourceV2ScanRelation.
To summarize, in this case, it looks like a new Project was added after the creation of DataSourceV2ScanRelation in the logical plan, causing it not prune columns.

@akshayakp97
Copy link
Contributor Author

In general, if a Project is added after the execution of V2ScanRelationPushDown rule - how do the columns get pruned? Or, do we not expect any new Project's?

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2023

@aokolnychyi are you aware of this issue? It looks like some additional pruning may be done after pushdown happens?

@IgorBerman
Copy link

Hi @rdblue and @aokolnychyi
Do you have new ideas regarding this issue and in general maybe you can provide pointers if Iceberg implements column pruning for highly nested schemas?
from my initial tests it seems that there are cases where it can improve, maybe it dependent on spark, where it's not perfect as well.

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Mar 24, 2025
@Akeron-Zhu
Copy link

Akeron-Zhu commented Mar 27, 2025

Hi, @IgorBerman @akshayakp97 @rdblue , I also encountered this problem in last year, it is because the Spark3 DSV2 only prune column at V2ScanRelationPushDown, but the later RewriteSubquery rule generated new predicates which can be use to prune column, but Spark does not prune column again.
I write an optimize rule to solve this problem which certainly improve Spark3 performance(eg. ~50% improvement for query16 in 10TB TPCDS in my test), and I create a PR for Spark after passed the tests of spark. The url is: apache/spark#50399
Hope this helps.

@IgorBerman
Copy link

thanks @Akeron-Zhu for the update! this improvement will be valuable for the community imo.

ps: Our problem is more general due to highly nested schemas which spark not handles well in column pruning(think of array of structs inside array of structs etc).

@github-actions github-actions bot removed the stale label Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants