DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

akshayakp97 · 2023-12-11T04:49:51Z

Query engine

Query Engine: Spark 3.5.0
Apache Iceberg: 1.4.2

Question

Hi,

My understanding is that Spark Optimizer can add new Project operator even after V2 Relation was created. For example, it looks like ColumnPruning optimizer rule triggers after V2ScanRelationPushDown here.

If that's the case, then it would be expected that the columns projected by the newly added Project operator would prune the schema (for ex, ,like how V2ScanRelationPushDown#pruneColumns does). But, I don't see schema pruning happening after V2ScanRelationPushDown for DatasourceV2. However, for DatasourceV1, I can see schema being pruned in FileSourceStrategy#apply method before FileSourceScanExec physical node is created.

I don't see a similar logic in DataSourceV2Strategy to prune the relation's schema with the latest Attribute's from Project's and Filter's before BatchScanExec is created.

Is there a known gap with DataSourceV2?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

rdblue · 2023-12-11T19:13:06Z

I don't think I'm following the logic here. Is there a case where you're not seeing columns being properly pruned?

akshayakp97 · 2023-12-11T19:51:04Z

Thanks for your response.

I am looking at TPCDS q16 physical plan for Iceberg on EMR.

Link to q16 - https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql

The physical plan looks like - https://gist.github.com/akshayakp97/102715c66eee44bc6f72493f427528f8

Line 46 projects only two columns from Project [cs_warehouse_sk#54840, cs_order_number#54843L], however it looks like Iceberg is scanning all columns for the catalog_sales table in Line 47.

Upon further digging, I found out that ColumnPruning rule adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L] operator, but we still see all columns read by the corresponding BatchScanExec.

akshayakp97 · 2023-12-11T20:03:46Z

After ColumnPruning adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L], when V2ScanRelationPushDown rule triggers, it doesn't match the ScanOperation pattern, because, the scan operator for catalog_sales in the logical plan seems to have been updated in the previous iterations of V2ScanRelationPushDown rule - which resulted in the conversion of ScanBuilderHolder to DataSourceV2ScanRelation.
To summarize, in this case, it looks like a new Project was added after the creation of DataSourceV2ScanRelation in the logical plan, causing it not prune columns.

akshayakp97 · 2023-12-11T20:18:20Z

In general, if a Project is added after the execution of V2ScanRelationPushDown rule - how do the columns get pruned? Or, do we not expect any new Project's?

rdblue · 2023-12-12T18:15:37Z

@aokolnychyi are you aware of this issue? It looks like some additional pruning may be done after pushdown happens?

IgorBerman · 2024-09-24T06:24:09Z

Hi @rdblue and @aokolnychyi
Do you have new ideas regarding this issue and in general maybe you can provide pointers if Iceberg implements column pruning for highly nested schemas?
from my initial tests it seems that there are cases where it can improve, maybe it dependent on spark, where it's not perfect as well.

github-actions · 2025-03-24T00:16:47Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Akeron-Zhu · 2025-03-27T03:31:07Z

Hi, @IgorBerman @akshayakp97 @rdblue , I also encountered this problem in last year, it is because the Spark3 DSV2 only prune column at V2ScanRelationPushDown, but the later RewriteSubquery rule generated new predicates which can be use to prune column, but Spark does not prune column again.
I write an optimize rule to solve this problem which certainly improve Spark3 performance(eg. ~50% improvement for query16 in 10TB TPCDS in my test), and I create a PR for Spark after passed the tests of spark. The url is: apache/spark#50399
Hope this helps.

IgorBerman · 2025-03-27T19:35:28Z

thanks @Akeron-Zhu for the update! this improvement will be valuable for the community imo.

ps: Our problem is more general due to highly nested schemas which spark not handles well in column pruning(think of array of structs inside array of structs etc).

github-actions bot added the stale label Mar 24, 2025

github-actions bot removed the stale label Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

akshayakp97 commented Dec 11, 2023 •

edited

Loading

rdblue commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

rdblue commented Dec 12, 2023

Uh oh!

IgorBerman commented Sep 24, 2024

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

Akeron-Zhu commented Mar 27, 2025 •

edited

Loading

Uh oh!

IgorBerman commented Mar 27, 2025

Uh oh!

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

Comments

akshayakp97 commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Query engine

Question

rdblue commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

akshayakp97 commented Dec 11, 2023

Uh oh!

rdblue commented Dec 12, 2023

Uh oh!

IgorBerman commented Sep 24, 2024

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

Akeron-Zhu commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IgorBerman commented Mar 27, 2025

Uh oh!

akshayakp97 commented Dec 11, 2023 •

edited

Loading

Akeron-Zhu commented Mar 27, 2025 •

edited

Loading