Skip to content

[SPARK-51691][CORE] SerializationDebugger should swallow exception when try to find the reason of serialization problem#50573

Closed
summaryzb wants to merge 1 commit intoapache:masterfrom
summaryzb:SPARK-51691_followup
Closed

[SPARK-51691][CORE] SerializationDebugger should swallow exception when try to find the reason of serialization problem#50573
summaryzb wants to merge 1 commit intoapache:masterfrom
summaryzb:SPARK-51691_followup

Conversation

@summaryzb
Copy link
Contributor

@summaryzb summaryzb commented Apr 13, 2025

What changes were proposed in this pull request?

Catch SparkRuntimeException when deep into serialization exception stack during unit test.

Why are the changes needed?

Present a clearer serialization exception stack hierarchy during test, toString implementation of TreeNode may throw exception since SQLConf.get check. It is helpful for debug the real problem

Does this PR introduce any user-facing change?

Yes, but it only take effect in unit test.
User will see the direct serialization exception and the reference chain beyond the root cause.
Before this pr, user will get confuse when unrelated exception is shown

WARN org.apache.spark.serializer.SerializationDebugger: Exception in serialization debugger
org.apache.spark.SparkRuntimeException: Cannot get SQLConf inside scheduler event loop thread.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGetSQLConfInSchedulerEventLoopThreadError(QueryExecutionErrors.scala:2002)
    at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:225)
    at org.apache.spark.sql.execution.ScalarSubquery.toString(subquery.scala:69)
    at java.lang.String.valueOf(String.java:2994)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:203)
    at scala.collection.immutable.Stream.addString(Stream.scala:701)
    at scala.collection.TraversableOnce.mkString(TraversableOnce.scala:377)
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.spark.SimpleFutureAction

After this pr

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.SimpleFutureAction
Serialization stack:
	- object not serializable (class: org.apache.spark.SimpleFutureAction, value: org.apache.spark.SimpleFutureAction@4050649d)
	- writeObject data (class: java.util.concurrent.ConcurrentHashMap)
	- object (class java.util.concurrent.ConcurrentHashMap)
	....(not shown)

How was this patch tested?

Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

@summaryzb
Copy link
Contributor Author

cc @mridulm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this even for non testing case, right ? It will be an improvement over earlier behavior (which would have thrown NPE), and minimize side effects of improveException

Copy link
Contributor Author

@summaryzb summaryzb Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mridulm Appreciate your time to review
1.NPE have been checked in line#94
2.SPARK-51691 and this follow main resolve the problem, that toString implementation of TreeNode may throw exception since SQLConf.get check during unit test.

Actually non testing case will not encounter similar problem

Copy link
Contributor

@mridulm mridulm Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPE I was referring to is what is getting thrown as part of executing toString : what we addressed in #50402

The current reason for NPE might be (2), as you indicated - and so perhaps scoped to Utils.isTesting currently.
But this can happen for any other reason too (including bugs in user udf's for example).

The diff I am proposing is, replace case _: SparkRuntimeException if Utils.isTesting with case _: Exception
I want to make sure this wont cause any other issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation
I got your point, how about make it case NonFatal(_)

@summaryzb summaryzb closed this Apr 18, 2025
@summaryzb summaryzb force-pushed the SPARK-51691_followup branch from b7c58ed to 23c77b7 Compare April 18, 2025 05:27
@summaryzb summaryzb reopened this Apr 18, 2025
@summaryzb summaryzb force-pushed the SPARK-51691_followup branch from 289ebc5 to 4515493 Compare April 18, 2025 05:32
@LuciferYang
Copy link
Contributor

@summaryzb I reopen SPARK-51691, please rebase and submit a complete pr here, thanks

@summaryzb summaryzb force-pushed the SPARK-51691_followup branch from 4515493 to 2e2562d Compare April 18, 2025 10:14
@summaryzb summaryzb changed the title [SPARK-51691][CORE][FOLLOWUP] Make SerializationDebugger work in non-… [SPARK-51691][CORE] SerializationDebugger should swallow exception when try to find the reason of serialization problem Apr 18, 2025
@summaryzb
Copy link
Contributor Author

@LuciferYang done, PTAL

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
+CC @LuciferYang

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@LuciferYang
Copy link
Contributor

Merged into master. Thanks @summaryzb and @mridulm

@summaryzb summaryzb deleted the SPARK-51691_followup branch May 29, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants