[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

xi-db · 2025-05-13T15:01:30Z

What changes were proposed in this pull request?

Currently, calling DataFrame.col(colName) with a non-existent column in the Scala Spark Connect client does not raise an error. In contrast, both PySpark (in either Spark Connect or Spark Classic) and Scala in Spark Classic do raise an exception in such cases. This leads to inconsistent behavior between Spark Connect and Spark Classic in Scala.

PySpark on Spark Classic:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

PySpark on Spark Connect:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Classic:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Connect:

> spark.range(10).col("nonexistent")
res5: org.apache.spark.sql.Column = unresolved_attribute {
  unparsed_identifier: "nonexistent"
  plan_id: 2
}

it doesn't throw any exceptions.

In this PR, eager validation of column names has been implemented in the DataFrame.col(colName) method of the Scala client to ensure consistent behavior with both Spark Classic and PySpark. The implementation here is based on the __getitem__ and verify_col_name methods in PySpark.

Now, it will throw an error in Scala client on Spark Connect:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Why are the changes needed?

This PR ensures consistent behavior between Spark Connect and Spark Classic in the scenario described above.

Does this PR introduce any user-facing change?

Yes, referencing non-existent column in Scala client will now throw an error.

How was this patch tested?

New test case.

Was this patch authored or co-authored using generative AI tooling?

No.

xi-db · 2025-05-13T15:03:30Z

Hi @zhengruifeng , I'm fixing the behaviour difference of referencing a non-existent column in Spark Connect Scala, based on the PySpark's __getitem__ and verify_col_name methods, could you please review this PR?

zhengruifeng · 2025-05-14T00:27:54Z

you may need to resolve the test failures, otherwise LGTM

hvanhovell · 2025-06-10T13:42:45Z

@xi-db the Connect API is supposed to be lazy. That we did this in Python is a mistake. Concretely, I can see two problems with this:

It can create quite a few more extra RPCs.
It is misleading. By the time you submit something for execution, your underlying data might have changed. You will see a failure anyway. This works for classic because we have eager analysis, and the Dataset is bound at definition time instead of execution time.

xi-db · 2025-06-10T14:13:23Z

@hvanhovell Thank you, makes sense. In this case, we can close this PR. Do we want to remove the column name validation from __getitem__ on PySpark, so df['non_existing_col'] won't trigger an analysis call? cc @zhengruifeng

Validate column name eagerly in Spark Connect Scala client

3e0b417

github-actions bot added SQL CONNECT labels May 13, 2025

Add comments

35ceb1f

Fix linting

f6439ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

Uh oh!

xi-db commented May 13, 2025 •

edited

Loading

Uh oh!

xi-db commented May 13, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented May 14, 2025

Uh oh!

hvanhovell commented Jun 10, 2025

Uh oh!

xi-db commented Jun 10, 2025

Uh oh!

Uh oh!

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

Are you sure you want to change the base?

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

Uh oh!

Conversation

xi-db commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

xi-db commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented May 14, 2025

Uh oh!

hvanhovell commented Jun 10, 2025

Uh oh!

xi-db commented Jun 10, 2025

Uh oh!

Uh oh!

xi-db commented May 13, 2025 •

edited

Loading

xi-db commented May 13, 2025 •

edited

Loading