Skip to content

[SPARK-52104][CONNECT][SCALA] Validate column name eagerly in Spark Connect Scala Client #50873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

xi-db
Copy link
Contributor

@xi-db xi-db commented May 13, 2025

What changes were proposed in this pull request?

Currently, calling DataFrame.col(colName) with a non-existent column in the Scala Spark Connect client does not raise an error. In contrast, both PySpark (in either Spark Connect or Spark Classic) and Scala in Spark Classic do raise an exception in such cases. This leads to inconsistent behavior between Spark Connect and Spark Classic in Scala.

PySpark on Spark Classic:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

PySpark on Spark Connect:

> spark.range(10)['nonexistent']
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Classic:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Scala on Spark Connect:

> spark.range(10).col("nonexistent")
res5: org.apache.spark.sql.Column = unresolved_attribute {
  unparsed_identifier: "nonexistent"
  plan_id: 2
}

it doesn't throw any exceptions.

In this PR, eager validation of column names has been implemented in the DataFrame.col(colName) method of the Scala client to ensure consistent behavior with both Spark Classic and PySpark. The implementation here is based on the __getitem__ and verify_col_name methods in PySpark.

Now, it will throw an error in Scala client on Spark Connect:

> spark.range(10).col("nonexistent")
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `nonexistent` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703

Why are the changes needed?

This PR ensures consistent behavior between Spark Connect and Spark Classic in the scenario described above.

Does this PR introduce any user-facing change?

Yes, referencing non-existent column in Scala client will now throw an error.

How was this patch tested?

New test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@xi-db
Copy link
Contributor Author

xi-db commented May 13, 2025

Hi @zhengruifeng , I'm fixing the behaviour difference of referencing a non-existent column in Spark Connect Scala, based on the PySpark's __getitem__ and verify_col_name methods, could you please review this PR?

@zhengruifeng
Copy link
Contributor

you may need to resolve the test failures, otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants