Skip to content

[SPARK-52580][PS] Avoid CAST_INVALID_INPUT of replace in ANSI mode #51297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jun 26, 2025

What changes were proposed in this pull request?

Avoid CAST_INVALID_INPUT of replace in ANSI mode.

Specifically, under ANSI mode

  • used try_cast() to safely cast values
  • NaN checks, we now avoid F.isnan() on non-numeric types

An example of the spark plan difference between ANSI on/off is:

# if the original column is of StringType
# ANSI off
Column<'CASE WHEN in(C, 0, 1, 2, 3, 5, 6) THEN 4 ELSE C END'>

# ANSI on
Column<'CASE WHEN in(C, TRY_CAST(0 AS STRING), TRY_CAST(1 AS STRING), TRY_CAST(2 AS STRING), TRY_CAST(3 AS STRING), TRY_CAST(5 AS STRING), TRY_CAST(6 AS STRING)) THEN TRY_CAST(4 AS STRING) ELSE TRY_CAST(C AS STRING) END'>

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52556.

Does this PR introduce any user-facing change?

Yes, replace works in ANSI, for example

>>> ps.set_option("compute.fail_on_ansi_mode", False)
>>> ps.set_option("compute.ansi_mode_support", True)
>>> pdf = pd.DataFrame(
...             {"A": [0, 1, 2, 3, np.nan], "B": [5, 6, 7, 8, np.nan], "C": ["a", "b", "c", "d", None]},
...             index=np.random.rand(5),
...         )
>>> psdf = ps.from_pandas(pdf)
>>> psdf["C"].replace([0, 1, 2, 3, 5, 6], 4)
0.458472       a
0.749773       b
0.222904       c
0.397280       d
0.293933    None
Name: C, dtype: object
>>> psdf.replace([0, 1, 2, 3, 5, 6], [6, 5, 4, 3, 2, 1])
            A    B     C                                                        
0.458472  6.0  2.0     a
0.749773  5.0  1.0     b
0.222904  4.0  7.0     c
0.397280  3.0  8.0     d
0.293933  NaN  NaN  None

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant