[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

revans2 · 2024-12-10T21:55:28Z

Describe the bug
In CI we have been seeing occasional failure related for NDS scale factor 3k when running on a grace hoper cluster. It appears to only ever crash when we are running with parquet data with decimals, not floats, for many of the number types.

We need someone to go through all of the historic runs and see if we can fully understand what is happening here, before we dig into a single possible explanation.

One of the odd things is that for at least a few of the runs we see errors when trying to deserialize a task.

24/12/09 08:25:06 INFO Executor: Running task 140.0 in stage 34.0 (TID 12218)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 41 with 1 pieces (estimated total size 4.0 MiB)
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 8.4 GiB)
24/12/09 08:25:06 INFO TorrentBroadcast: Reading broadcast variable 41 took 2 ms
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41 stored as values in memory (estimated size 81.8 KiB, free 8.4 GiB)
24/12/09 08:25:06 ERROR Executor: Exception in task 140.0 in stage 34.0 (TID 12218)
java.io.IOException: unexpected exception type
	at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
--
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
--
	at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
	at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
	at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
	at org.apache.spark.sql.catalyst.InternalRow$.$deserializeLambda$(InternalRow.scala)
	... 337 more
Caused by: java.lang.NullPointerException
	at java.lang.invoke.CallSite.makeSite(CallSite.java:325)
	... 340 more
24/12/09 08:25:06 INFO CoarseGrainedExecutorBackend: Got assigned task 12226
24/12/09 08:25:06 INFO Executor: Running task 158.0 in stage 34.0 (TID 12226)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 40 with 1 pieces (estimated total size 4.0 MiB)

When we zoom in on the last part of the calls we see.

        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:87)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)

This is on Spark 3.4.3, so ShuffleMapTask.scala:87 is just trying to deserialize a (RDD[_], ShuffleDependency[_, _, _]) tuple of RDD + ShuffleDependency.

The odd part is that this appears to pass on retry. Currently I suspect that it is some kind of memory/network corruption because the grace hopper hardware we are running on is pre-production, but it is not specific to a single node and it is specific to a single query, so that makes it more fun to try and debug.

The text was updated successfully, but these errors were encountered:

revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 10, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

revans2 commented Dec 10, 2024

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

Comments

revans2 commented Dec 10, 2024