Add support for dumping write data to try and reproduce error cases #11864

revans2 · 2024-12-11T21:34:11Z

This fixes #11853

This does not have any C++ code to be able to interpret the jcudf serialization format to make reproducing things simpler offline. But I will begin working on some example code in spark-rapids-jni to help with this.

I also have not added in any documentation yet. I am not 100% sure what we want to do here, but we can try it out and see.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2024-12-11T21:34:20Z

build

revans2 · 2024-12-11T21:35:31Z

build

jlowe

Suggestion for handling double-runs better but otherwise lgtm.

jlowe · 2024-12-11T21:56:10Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileFormatDataWriter.scala

+          f"-c$fileCounter%03d" + ".debug"
+      }  else {
+        base + "/" + partDir.mkString("/") + s"/DEBUG_" +
+          taskAttemptContext.getTaskAttemptID.toString + f"-c$fileCounter%03d" + ".debug"


Do we want to leverage the application ID to help make this more unique? Otherwise if someone runs this twice in a row without updating the base path, we're going to get errors due to files already existing. The reader dump path uses random IDs and retries to make this possible, so it would be nice if it was also supported here.

I was thinking about that, but I didn't know how to get the application id, and I thought you had issues getting it consistently too.

I'll put in a timestamp to disambiguate

revans2 · 2024-12-11T22:22:01Z

build

revans2 · 2024-12-12T14:26:17Z

I am getting an error in databricks for.

[2024-12-12T01:41:29.155Z] E                   Caused by: java.io.NotSerializableException: org.apache.spark.sql.hive.execution.HiveFileFormat
[2024-12-12T01:41:29.155Z] E                   Serialization stack:
[2024-12-12T01:41:29.155Z] E                   	- object not serializable (class: org.apache.spark.sql.hive.execution.HiveFileFormat, value: org.apache.spark.sql.hive.execution.HiveFileFormat@12a5fc4f)
[2024-12-12T01:41:29.155Z] E                   	- field (class: org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, name: fileFormat, type: interface org.apache.spark.sql.execution.datasources.FileFormat)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, GpuWriteFiles
[2024-12-12T01:41:29.155Z] E                   +- GpuSort [((hive-hash(_c0#602053, _c1#602054, _c2#602055, _c3#602056, _c4#602057L, _c5#602058, _c6#602059, _c7#602060, _c8#602061, _c9#602062) & 2147483647) pmod 4) ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@2fcfd62d, [loreId=4]
[2024-12-12T01:41:29.155Z] E                      +- GpuRowToColumnar targetsize(104857600), [loreId=3]
[2024-12-12T01:41:29.155Z] E                         +- *(1) Scan ExistingRDD[_c0#602053,_c1#602054,_c2#602055,_c3#602056,_c4#602057L,_c5#602058,_c6#602059,_c7#602060,_c8#602061,_c9#602062]
[2024-12-12T01:41:29.155Z] E                   )
[2024-12-12T01:41:29.155Z] E                   	- element of array (index: 0)
[2024-12-12T01:41:29.155Z] E                   	- array (class [Ljava.lang.Object;, size 5)
[2024-12-12T01:41:29.155Z] E                   	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
[2024-12-12T01:41:29.155Z] E                   	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/datasources/GpuWriteFilesExec.$anonfun$doExecuteColumnarWrite$1:(Lorg/apache/spark/sql/execution/datasources/GpuWriteFilesExec;Lorg/apache/spark/sql/rapids/GpuWriteJobDescription;Ljava/lang/String;Lorg/apache/spark/internal/io/FileCommitProtocol;Lscala/Option;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=5])
[2024-12-12T01:41:29.155Z] E                   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec$$Lambda$7125/2116929153, org.apache.spark.sql.execution.datasources.GpuWriteFilesExec$$Lambda$7125/2116929153@7e543e7a)
[2024-12-12T01:41:29.155Z] E                   	- element of array (index: 0)
[2024-12-12T01:41:29.155Z] E                   	- array (class [Ljava.lang.Object;, size 1)
[2024-12-12T01:41:29.155Z] E                   	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
[2024-12-12T01:41:29.155Z] E                   	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsInternal$2$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
[2024-12-12T01:41:29.155Z] E                   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.rdd.RDD$$Lambda$3245/1997275976, org.apache.spark.rdd.RDD$$Lambda$3245/1997275976@2e0b8061)
[2024-12-12T01:41:29.155Z] E                   	- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[76674] at runColumnar at GpuDataWritingCommandExec.scala:116)
[2024-12-12T01:41:29.155Z] E                   	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
[2024-12-12T01:41:29.155Z] E                   	- object (class scala.Tuple2, (MapPartitionsRDD[76674] at runColumnar at GpuDataWritingCommandExec.scala:116,org.apache.spark.sql.rapids.GpuFileFormatWriter$$$Lambda$7126/1474465336@59492c9b))
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1982)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1608)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1550)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3598)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3589)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3577)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)

I'll try to understand why the HiveFileFormat is no longer serializable

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2024-12-12T14:53:15Z

build

revans2 · 2024-12-13T15:08:41Z

build

revans2 · 2024-12-13T15:09:17Z

@jlowe please take another look. I had a test failure related to the logging changes that happened in CUDF recently. I upmerged so it should be fixed now.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2024-12-17T18:08:14Z

build

revans2 · 2024-12-17T18:08:41Z

Sorry @jlowe a merge conflict so I need your approval yet again.

revans2 · 2024-12-17T18:53:04Z

build

revans2 · 2024-12-18T13:44:08Z

CI Timed out on one job. This is important for debugging so I am just going to merge it.

Add support for dumping write data to try and reproduce error cases

0e27736

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Missed one copyright date

ee7d35d

jlowe previously approved these changes Dec 11, 2024

View reviewed changes

Review comments

fb87ad9

revans2 dismissed jlowe’s stale review via fb87ad9 December 11, 2024 22:21

Remove this from a lambda

8f8ce17

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 added 2 commits December 13, 2024 08:59

Merge branch 'branch-25.02' into dump_write_data

b5403c4

Merge branch 'branch-25.02' into dump_write_data

da82373

jlowe previously approved these changes Dec 13, 2024

View reviewed changes

Merge branch 'branch-25.02' into dump_write_data

d00f822

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 dismissed jlowe’s stale review via d00f822 December 17, 2024 18:07

revans2 changed the title ~~Add support for dumping write data to try and reproduce error cases [DATABRICKS]~~ Add support for dumping write data to try and reproduce error cases Dec 17, 2024

jlowe approved these changes Dec 17, 2024

View reviewed changes

revans2 merged commit 3f26d33 into NVIDIA:branch-25.02 Dec 18, 2024
49 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dumping write data to try and reproduce error cases #11864

Add support for dumping write data to try and reproduce error cases #11864

revans2 commented Dec 11, 2024

revans2 commented Dec 11, 2024

revans2 commented Dec 11, 2024

jlowe left a comment

jlowe Dec 11, 2024

revans2 Dec 11, 2024

revans2 Dec 11, 2024

revans2 commented Dec 11, 2024

revans2 commented Dec 12, 2024

revans2 commented Dec 12, 2024

revans2 commented Dec 13, 2024

revans2 commented Dec 13, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 18, 2024

Add support for dumping write data to try and reproduce error cases #11864

Add support for dumping write data to try and reproduce error cases #11864

Conversation

revans2 commented Dec 11, 2024

revans2 commented Dec 11, 2024

revans2 commented Dec 11, 2024

jlowe left a comment

Choose a reason for hiding this comment

jlowe Dec 11, 2024

Choose a reason for hiding this comment

revans2 Dec 11, 2024

Choose a reason for hiding this comment

revans2 Dec 11, 2024

Choose a reason for hiding this comment

revans2 commented Dec 11, 2024

revans2 commented Dec 12, 2024

revans2 commented Dec 12, 2024

revans2 commented Dec 13, 2024

revans2 commented Dec 13, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 17, 2024

revans2 commented Dec 18, 2024