You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
In CI we have been seeing occasional failure related for NDS scale factor 3k when running on a grace hoper cluster. It appears to only ever crash when we are running with parquet data with decimals, not floats, for many of the number types.
We need someone to go through all of the historic runs and see if we can fully understand what is happening here, before we dig into a single possible explanation.
One of the odd things is that for at least a few of the runs we see errors when trying to deserialize a task.
24/12/09 08:25:06 INFO Executor: Running task 140.0 in stage 34.0 (TID 12218)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 41 with 1 pieces (estimated total size 4.0 MiB)
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 8.4 GiB)
24/12/09 08:25:06 INFO TorrentBroadcast: Reading broadcast variable 41 took 2 ms
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41 stored as values in memory (estimated size 81.8 KiB, free 8.4 GiB)
24/12/09 08:25:06 ERROR Executor: Exception in task 140.0 in stage 34.0 (TID 12218)
java.io.IOException: unexpected exception type
at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750)
at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
--
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
--
at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
at org.apache.spark.sql.catalyst.InternalRow$.$deserializeLambda$(InternalRow.scala)
... 337 more
Caused by: java.lang.NullPointerException
at java.lang.invoke.CallSite.makeSite(CallSite.java:325)
... 340 more
24/12/09 08:25:06 INFO CoarseGrainedExecutorBackend: Got assigned task 12226
24/12/09 08:25:06 INFO Executor: Running task 158.0 in stage 34.0 (TID 12226)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 40 with 1 pieces (estimated total size 4.0 MiB)
When we zoom in on the last part of the calls we see.
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:87)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
This is on Spark 3.4.3, so ShuffleMapTask.scala:87 is just trying to deserialize a (RDD[_], ShuffleDependency[_, _, _]) tuple of RDD + ShuffleDependency.
The odd part is that this appears to pass on retry. Currently I suspect that it is some kind of memory/network corruption because the grace hopper hardware we are running on is pre-production, but it is not specific to a single node and it is specific to a single query, so that makes it more fun to try and debug.
The text was updated successfully, but these errors were encountered:
Describe the bug
In CI we have been seeing occasional failure related for NDS scale factor 3k when running on a grace hoper cluster. It appears to only ever crash when we are running with parquet data with decimals, not floats, for many of the number types.
We need someone to go through all of the historic runs and see if we can fully understand what is happening here, before we dig into a single possible explanation.
One of the odd things is that for at least a few of the runs we see errors when trying to deserialize a task.
When we zoom in on the last part of the calls we see.
This is on Spark 3.4.3, so ShuffleMapTask.scala:87 is just trying to deserialize a
(RDD[_], ShuffleDependency[_, _, _])
tuple of RDD + ShuffleDependency.The odd part is that this appears to pass on retry. Currently I suspect that it is some kind of memory/network corruption because the grace hopper hardware we are running on is pre-production, but it is not specific to a single node and it is specific to a single query, so that makes it more fun to try and debug.
The text was updated successfully, but these errors were encountered: