Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dumping write data to try and reproduce error cases #11864

Merged
merged 7 commits into from
Dec 18, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Dec 11, 2024

This fixes #11853

This does not have any C++ code to be able to interpret the jcudf serialization format to make reproducing things simpler offline. But I will begin working on some example code in spark-rapids-jni to help with this.

I also have not added in any documentation yet. I am not 100% sure what we want to do here, but we can try it out and see.

@revans2
Copy link
Collaborator Author

revans2 commented Dec 11, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 11, 2024

build

jlowe
jlowe previously approved these changes Dec 11, 2024
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for handling double-runs better but otherwise lgtm.

f"-c$fileCounter%03d" + ".debug"
} else {
base + "/" + partDir.mkString("/") + s"/DEBUG_" +
taskAttemptContext.getTaskAttemptID.toString + f"-c$fileCounter%03d" + ".debug"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to leverage the application ID to help make this more unique? Otherwise if someone runs this twice in a row without updating the base path, we're going to get errors due to files already existing. The reader dump path uses random IDs and retries to make this possible, so it would be nice if it was also supported here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, but I didn't know how to get the application id, and I thought you had issues getting it consistently too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put in a timestamp to disambiguate

@revans2
Copy link
Collaborator Author

revans2 commented Dec 11, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 12, 2024

I am getting an error in databricks for.

[2024-12-12T01:41:29.155Z] E                   Caused by: java.io.NotSerializableException: org.apache.spark.sql.hive.execution.HiveFileFormat
[2024-12-12T01:41:29.155Z] E                   Serialization stack:
[2024-12-12T01:41:29.155Z] E                   	- object not serializable (class: org.apache.spark.sql.hive.execution.HiveFileFormat, value: org.apache.spark.sql.hive.execution.HiveFileFormat@12a5fc4f)
[2024-12-12T01:41:29.155Z] E                   	- field (class: org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, name: fileFormat, type: interface org.apache.spark.sql.execution.datasources.FileFormat)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, GpuWriteFiles
[2024-12-12T01:41:29.155Z] E                   +- GpuSort [((hive-hash(_c0#602053, _c1#602054, _c2#602055, _c3#602056, _c4#602057L, _c5#602058, _c6#602059, _c7#602060, _c8#602061, _c9#602062) & 2147483647) pmod 4) ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@2fcfd62d, [loreId=4]
[2024-12-12T01:41:29.155Z] E                      +- GpuRowToColumnar targetsize(104857600), [loreId=3]
[2024-12-12T01:41:29.155Z] E                         +- *(1) Scan ExistingRDD[_c0#602053,_c1#602054,_c2#602055,_c3#602056,_c4#602057L,_c5#602058,_c6#602059,_c7#602060,_c8#602061,_c9#602062]
[2024-12-12T01:41:29.155Z] E                   )
[2024-12-12T01:41:29.155Z] E                   	- element of array (index: 0)
[2024-12-12T01:41:29.155Z] E                   	- array (class [Ljava.lang.Object;, size 5)
[2024-12-12T01:41:29.155Z] E                   	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
[2024-12-12T01:41:29.155Z] E                   	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/datasources/GpuWriteFilesExec.$anonfun$doExecuteColumnarWrite$1:(Lorg/apache/spark/sql/execution/datasources/GpuWriteFilesExec;Lorg/apache/spark/sql/rapids/GpuWriteJobDescription;Ljava/lang/String;Lorg/apache/spark/internal/io/FileCommitProtocol;Lscala/Option;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=5])
[2024-12-12T01:41:29.155Z] E                   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.sql.execution.datasources.GpuWriteFilesExec$$Lambda$7125/2116929153, org.apache.spark.sql.execution.datasources.GpuWriteFilesExec$$Lambda$7125/2116929153@7e543e7a)
[2024-12-12T01:41:29.155Z] E                   	- element of array (index: 0)
[2024-12-12T01:41:29.155Z] E                   	- array (class [Ljava.lang.Object;, size 1)
[2024-12-12T01:41:29.155Z] E                   	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
[2024-12-12T01:41:29.155Z] E                   	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/rdd/RDD.$anonfun$mapPartitionsInternal$2$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
[2024-12-12T01:41:29.155Z] E                   	- writeReplace data (class: java.lang.invoke.SerializedLambda)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.rdd.RDD$$Lambda$3245/1997275976, org.apache.spark.rdd.RDD$$Lambda$3245/1997275976@2e0b8061)
[2024-12-12T01:41:29.155Z] E                   	- field (class: org.apache.spark.rdd.MapPartitionsRDD, name: f, type: interface scala.Function3)
[2024-12-12T01:41:29.155Z] E                   	- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[76674] at runColumnar at GpuDataWritingCommandExec.scala:116)
[2024-12-12T01:41:29.155Z] E                   	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
[2024-12-12T01:41:29.155Z] E                   	- object (class scala.Tuple2, (MapPartitionsRDD[76674] at runColumnar at GpuDataWritingCommandExec.scala:116,org.apache.spark.sql.rapids.GpuFileFormatWriter$$$Lambda$7126/1474465336@59492c9b))
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1982)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1608)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1550)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3598)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3589)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3577)
[2024-12-12T01:41:29.155Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)

I'll try to understand why the HiveFileFormat is no longer serializable

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@revans2
Copy link
Collaborator Author

revans2 commented Dec 12, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 13, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 13, 2024

@jlowe please take another look. I had a test failure related to the logging changes that happened in CUDF recently. I upmerged so it should be fixed now.

jlowe
jlowe previously approved these changes Dec 13, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Dec 17, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 17, 2024

Sorry @jlowe a merge conflict so I need your approval yet again.

@revans2 revans2 changed the title Add support for dumping write data to try and reproduce error cases [DATABRICKS] Add support for dumping write data to try and reproduce error cases Dec 17, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Dec 17, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Dec 18, 2024

CI Timed out on one job. This is important for debugging so I am just going to merge it.

@revans2 revans2 merged commit 3f26d33 into NVIDIA:branch-25.02 Dec 18, 2024
49 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Ability to dump tables on a write
2 participants