[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399

Akeron-Zhu · 2025-03-26T07:46:07Z

What changes were proposed in this pull request?

This PR offers an optimize rule for SparkOptimizer to prune unnecessary column for DataSourceV2 (DSV2) after RewriteSubquery.
Spark 3 use V2ScanRelationPushDown rule to prune column for DSV2. However, if there are subquerys in the qeuery sql, RewriteSubery rule will be generated new predicates which can be use to prune column after executed V2ScanRelationPushDown, but Spark does not prune column again which cause lower performance.
See the issue for more detail description : SPARK-50873

Why are the changes needed?

A better performance for Spark DSV2.
For example, in 10T TPCDS test, the query16 execution time will be reduced by 50% from 2.5min to 1.3min in my cluster.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GitHub Actions.

Was this patch authored or co-authored using generative AI tooling?

No.

cloud-fan · 2025-05-07T08:35:22Z

can you add a test to show an example query that gets benefit from this change?

jackylee-ch · 2025-05-08T02:19:28Z

It seems that we have generated a new ScanBuilder. Does this affect the filters that have been pushed down through V2ScanRelationPushDown?

Akeron-Zhu · 2025-05-09T06:51:34Z

It seems that we have generated a new ScanBuilder. Does this affect the filters that have been pushed down through V2ScanRelationPushDown?

Thanks for your question. It will not affect. Because this rule will check the Scan by comparing Scan's columns and columns required for Project and Filter, if there are no needless columns, it will do nothing. Otherwise, it will generate a new Scan and finally add the Project and Filter on the Scan if necessary. In summary, it only change the scan.
I learn from V2ScanRelationPushDown pruneColumns func to write this rule. There is only some different logic to prune unnecessary columns that generated by RewriteSubquery rule.

LuciferYang · 2025-05-14T08:57:18Z

It seems that it could also resolve the issue described in SPARK-51831. Can the case described in SPARK-51831 be modified into a test case?

Akeron-Zhu · 2025-05-14T09:19:57Z

It seems that it could also resolve the issue described in SPARK-51831. Can the case described in SPARK-51831 be modified into a test case?

Thank you! @LuciferYang , it really helps.
Yes, I wrote this PR because I encountered the same problem as SPARK-51831. After checking the column pruning of DSV1 in the code, I found that DSV1 calculates the required columns and performs column pruning when generating SCAN at the end. So this PR is the same as the DSV1 method, which check the plan and perform the column pruning for unnecessary column.

jackylee-ch · 2025-05-14T12:33:47Z

After running fllow test case, I found no filter is pushed down to Scan, although columns are pruned. @Akeron-Zhu Could you check for it?

test("Test exist join with v2 source plan") {
    import org.apache.spark.sql.functions._
    withTempPath { dir =>
      spark.range(100)
        .withColumn("col1", col("id") + 1)
        .withColumn("col2", col("id") + 2)
        .withColumn("col3", col("id") + 3)
        .withColumn("col4", col("id") + 4)
        .withColumn("col5", col("id") + 5)
        .withColumn("col6", col("id") + 6)
        .withColumn("col7", col("id") + 7)
        .withColumn("col8", col("id") + 8)
        .withColumn("col9", col("id") + 9)
        .write
        .mode("overwrite")
        .parquet(dir.getCanonicalPath + "/t1")
      spark.range(10).write.mode("overwrite").parquet(dir.getCanonicalPath + "/t2")
      Seq("parquet", "").foreach { v1SourceList =>
        withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key-> v1SourceList) {
          spark.read.parquet(dir.getCanonicalPath + "/t1").createOrReplaceTempView("t1")
          spark.read.parquet(dir.getCanonicalPath + "/t2").createOrReplaceTempView("t2")
          spark.sql(
            """
              |select sum(t1.id) as sum_id
              |from t1, t2
              |where t1.id == t2.id
              |      and exists(select * from t1 where t1.id == t2.id  and t1.col1>5)
              |""".stripMargin).explain()
        }
      }
    }
  }

Before this PR:
DataSource V1:

FileScan parquet [id#32L,col1#33L] Batched: true, DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], ReadSchema: struct<id:bigint,col1:bigint>

DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0b5b1ad-138c-4a87-a39b-12c50600f061/t1[id#58L, col1#59L, col2#60L, col3#61L, col4#62L, col5#63L, col6#64L, col7#65L, col8#66L, col9#67L] ParquetScan DataFilters: [isnotnull(col1#59L), (col1#59L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR:
DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-4030df5b-b0de-423d-b548-07b85390bade/t1[id#58L, col1#59L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Akeron-Zhu · 2025-05-14T12:53:06Z

After running fllow test case, I found no filter is pushed down to Scan, although columns are pruned. @Akeron-Zhu Could you check for it?

test("Test exist join with v2 source plan") {
    import org.apache.spark.sql.functions._
    withTempPath { dir =>
      spark.range(100)
        .withColumn("col1", col("id") + 1)
        .withColumn("col2", col("id") + 2)
        .withColumn("col3", col("id") + 3)
        .withColumn("col4", col("id") + 4)
        .withColumn("col5", col("id") + 5)
        .withColumn("col6", col("id") + 6)
        .withColumn("col7", col("id") + 7)
        .withColumn("col8", col("id") + 8)
        .withColumn("col9", col("id") + 9)
        .write
        .mode("overwrite")
        .parquet(dir.getCanonicalPath + "/t1")
      spark.range(10).write.mode("overwrite").parquet(dir.getCanonicalPath + "/t2")
      Seq("parquet", "").foreach { v1SourceList =>
        withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key-> v1SourceList) {
          spark.read.parquet(dir.getCanonicalPath + "/t1").createOrReplaceTempView("t1")
          spark.read.parquet(dir.getCanonicalPath + "/t2").createOrReplaceTempView("t2")
          spark.sql(
            """
              |select sum(t1.id) as sum_id
              |from t1, t2
              |where t1.id == t2.id
              |      and exists(select * from t1 where t1.id == t2.id  and t1.col1>5)
              |""".stripMargin).explain()
        }
      }
    }
  }

Before this PR: DataSource V1:

FileScan parquet [id#32L,col1#33L] Batched: true, DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], ReadSchema: struct<id:bigint,col1:bigint>

DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0b5b1ad-138c-4a87-a39b-12c50600f061/t1[id#58L, col1#59L, col2#60L, col3#61L, col4#62L, col5#63L, col6#64L, col7#65L, col8#66L, col9#67L] ParquetScan DataFilters: [isnotnull(col1#59L), (col1#59L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR: DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-4030df5b-b0de-423d-b548-07b85390bade/t1[id#58L, col1#59L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Oh, I see, It is my mistake, I thought pushed filter is fliters in plan, Thanks for point out. I will check it later.

Akeron-Zhu · 2025-05-16T13:02:48Z

Hello teachers, @cloud-fan @jackylee-ch @LuciferYang , I submitted and tested my another solution SPARK-50873-2 today. This solution only addresses the problem in SPARK-51831 which caused by EXISTS subquery. The solution rewrite "SELECT *" as "SELECT 1" in WHERE EXISTS during optimization phase. Because all these problems in TPCDS are caused by using "SELECT *" in EXISTS. I have submitted the PR, but it currently cannot pass the TPCDSV1_4-PLanStability and TPCDSV1_4-PLanStabilityWithStats tests, because after this rule, some columns may be swapped and ID may be changed, it is different from the plan file in source code, but the plan and answer is right. Like the picture show:

After @jackylee-ch reminder, I found that my first solution (this PR) cannot push down operators such as filters and aggregate. If push down all operators, I need to rewrite V2ScanRelationPushDown rule completely because SCAN cannot be modified after generated, the SCAN interface in spark does not provide modify function.
I wrote the first solution hoping to calculate the required columns at the end like FileScan, which can effectively avoid unnecessary columns in any situation. So I have an idea, can V2ScanRelationPushDown be executed at last in SparkOptimizer? As there are many involved and I am a new learner of Spark, it is up to all teachers.

Akeron-Zhu · 2025-05-16T13:50:54Z

After running fllow test case, I found no filter is pushed down to Scan, although columns are pruned. @Akeron-Zhu Could you check for it?

test("Test exist join with v2 source plan") {
    import org.apache.spark.sql.functions._
    withTempPath { dir =>
      spark.range(100)
        .withColumn("col1", col("id") + 1)
        .withColumn("col2", col("id") + 2)
        .withColumn("col3", col("id") + 3)
        .withColumn("col4", col("id") + 4)
        .withColumn("col5", col("id") + 5)
        .withColumn("col6", col("id") + 6)
        .withColumn("col7", col("id") + 7)
        .withColumn("col8", col("id") + 8)
        .withColumn("col9", col("id") + 9)
        .write
        .mode("overwrite")
        .parquet(dir.getCanonicalPath + "/t1")
      spark.range(10).write.mode("overwrite").parquet(dir.getCanonicalPath + "/t2")
      Seq("parquet", "").foreach { v1SourceList =>
        withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key-> v1SourceList) {
          spark.read.parquet(dir.getCanonicalPath + "/t1").createOrReplaceTempView("t1")
          spark.read.parquet(dir.getCanonicalPath + "/t2").createOrReplaceTempView("t2")
          spark.sql(
            """
              |select sum(t1.id) as sum_id
              |from t1, t2
              |where t1.id == t2.id
              |      and exists(select * from t1 where t1.id == t2.id  and t1.col1>5)
              |""".stripMargin).explain()
        }
      }
    }
  }

Before this PR: DataSource V1:

FileScan parquet [id#32L,col1#33L] Batched: true, DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], ReadSchema: struct<id:bigint,col1:bigint>

DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0b5b1ad-138c-4a87-a39b-12c50600f061/t1[id#58L, col1#59L, col2#60L, col3#61L, col4#62L, col5#63L, col6#64L, col7#65L, col8#66L, col9#67L] ParquetScan DataFilters: [isnotnull(col1#59L), (col1#59L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-c0..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR: DataSource V2:

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-4030df5b-b0de-423d-b548-07b85390bade/t1[id#58L, col1#59L] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-40..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Hello, Jackylee, I found that my first solution cannot push down operators such as filters and aggregate. If push down all operators, I need to rewrite V2ScanRelationPushDown rule completely because SCAN cannot be modified after generated, the SCAN interface in spark does not provide modify function. So I submitted other solutions SPARK-50873-2 & SPARK-52186.
Thanks again for your reminder.

Akeron-Zhu added 2 commits March 26, 2025 11:12

Prune column after RewriteSubquery for DSV2

d2b7cb1

Prune column for DSV2 after RewriteSubquery

8ec67f5

github-actions bot added the SQL label Mar 26, 2025

Akeron-Zhu mentioned this pull request Mar 27, 2025

DatasourceV2 does not prune columns after V2ScanRelationPushDown apache/iceberg#9268

Open

Akeron-Zhu closed this May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399

[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399

Akeron-Zhu commented Mar 26, 2025 •

edited

Loading

cloud-fan commented May 7, 2025

jackylee-ch commented May 8, 2025

Akeron-Zhu commented May 9, 2025

LuciferYang commented May 14, 2025

Akeron-Zhu commented May 14, 2025

jackylee-ch commented May 14, 2025

Akeron-Zhu commented May 14, 2025

Akeron-Zhu commented May 16, 2025 •

edited

Loading

Akeron-Zhu commented May 16, 2025

[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399

[Spark-50873][SQL] Prune column after RewriteSubquery rule for DSV2 #50399

Conversation

Akeron-Zhu commented Mar 26, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented May 7, 2025

jackylee-ch commented May 8, 2025

Akeron-Zhu commented May 9, 2025

LuciferYang commented May 14, 2025

Akeron-Zhu commented May 14, 2025

jackylee-ch commented May 14, 2025

Akeron-Zhu commented May 14, 2025

Akeron-Zhu commented May 16, 2025 • edited Loading

Akeron-Zhu commented May 16, 2025

Akeron-Zhu commented Mar 26, 2025 •

edited

Loading

Akeron-Zhu commented May 16, 2025 •

edited

Loading