[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919
+284
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
When a Java program runs for a long time without giving you any feedback/output, how do you determine what the program might be doing and whether it’s stuck? Thread dumps can help in such case. It shows the status of the threads (if it's running/waiting/blocked) and which part of the code is being executed by each thread, being important to detect deadlocks and which part of the program is running.
The purpose of this pull request is to collect thread dumps at regular intervals. Why? Getting a single thread dump only shows a snapshot of threads, getting several allows us to see if threads are progressing by comparing states.
Collecting thread dump samples from slow Spark executors or drivers can be challenging, especially in YARN or Kubernetes environments.
Actual solutions which are available for debugging:
Why are the changes needed?
The purpose of this feature request is to automate the thread dump collection at regular intervals. New Spark parameters have been introduced:
Example commands
The thread dumps will be saved into hdfs:///user/example/jstack_test, example file names: app-20250516161130-0000-driver-2025-05-16_16_12_50.txt, app-20250516161130-0000-0-2025-05-16_16_12_51.txt
The thread dumps will be added to the log messages
Only those thread dumps will be captured which match the given regular expression (spark.threadDumpCollector.include.regex)
Does this PR introduce any user-facing change?
Yes, see above
How was this patch tested?
New unit tests have been created and it has been manually tested as well.
Was this patch authored or co-authored using generative AI tooling?
No