[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919

roczei · 2025-05-16T14:36:43Z

What changes were proposed in this pull request?

When a Java program runs for a long time without giving you any feedback/output, how do you determine what the program might be doing and whether it’s stuck? Thread dumps can help in such case. It shows the status of the threads (if it's running/waiting/blocked) and which part of the code is being executed by each thread, being important to detect deadlocks and which part of the program is running.

The purpose of this pull request is to collect thread dumps at regular intervals. Why? Getting a single thread dump only shows a snapshot of threads, getting several allows us to see if threads are progressing by comparing states.

Collecting thread dump samples from slow Spark executors or drivers can be challenging, especially in YARN or Kubernetes environments.

Actual solutions which are available for debugging:

We need to find out where the Java Virtual Machine (JVM) is running then run the jstack command manually.
Download the thread dumps from the Spark UI. For example: http://localhost:4040/executors/threadDump/?executorId=driver
Download the thread dumps via Spark API. For example:

curl "http://localhost:4040/api/v1/applications/local-1747400853731/executors/driver/threads"

Why are the changes needed?

The purpose of this feature request is to automate the thread dump collection at regular intervals. New Spark parameters have been introduced:

spark.driver.threadDumpCollector.enabled
spark.executor.threadDumpCollector.enabled
spark.threadDumpCollector.interval
spark.threadDumpCollector.dir
spark.threadDumpCollector.output.type
spark.threadDumpCollector.include.regex

Example commands

spark-shell  --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=FILE --conf spark.threadDumpCollector.dir=hdfs:///user/example/jstack_test

The thread dumps will be saved into hdfs:///user/example/jstack_test, example file names: app-20250516161130-0000-driver-2025-05-16_16_12_50.txt, app-20250516161130-0000-0-2025-05-16_16_12_51.txt

spark-shell  --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=LOG

The thread dumps will be added to the log messages

spark-shell  --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=LOG --conf spark.threadDumpCollector.include.regex=something

Only those thread dumps will be captured which match the given regular expression (spark.threadDumpCollector.include.regex)

Does this PR introduce any user-facing change?

Yes, see above

How was this patch tested?

New unit tests have been created and it has been manually tested as well.

Was this patch authored or co-authored using generative AI tooling?

No

…lications ### What changes were proposed in this pull request? When a Java program runs for a long time without giving you any feedback/output, how do you determine what the program might be doing and whether it’s stuck? Thread dumps can help in such case. It shows the status of the threads (if it's running/waiting/blocked) and which part of the code is being executed by each thread, being important to detect deadlocks and which part of the program is running. **The purpose of this pull request is to collect thread dumps at regular intervals. Why? Getting a single thread dump only shows a snapshot of threads, getting several allows us to see if threads are progressing by comparing states.** Collecting thread dump samples from slow Spark executors or drivers can be challenging, especially in YARN or Kubernetes environments. Actual solutions which are available for debugging: 1) We need to find out where the Java Virtual Machine (JVM) is running then run the jstack command manually. 2) Download the thread dumps from the Spark UI. For example: http://localhost:4040/executors/threadDump/?executorId=driver 3) Download the thread dumps via Spark API. For example: curl "http://localhost:4040/api/v1/applications/local-1747400853731/executors/driver/threads" ### Why are the changes needed? The purpose of this feature request is to automate the thread dump collection at regular intervals. New Spark parameters have been introduced: - spark.driver.threadDumpCollector.enabled - spark.executor.threadDumpCollector.enabled - spark.threadDumpCollector.interval - spark.threadDumpCollector.dir - spark.threadDumpCollector.output.type - spark.threadDumpCollector.include.regex Example commands 1) spark-shell --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=FILE --conf spark.threadDumpCollector.dir=hdfs:///user/example/jstack_test The thread dumps will be saved into hdfs:///user/example/jstack_test, example file names: app-20250516161130-0000-driver-2025-05-16_16_12_50.txt, app-20250516161130-0000-0-2025-05-16_16_12_51.txt 2) spark-shell --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=LOG The thread dumps will be added to the log messages 3) spark-shell --master local-cluster[2,1,1050] --conf spark.driver.threadDumpCollector.enabled=true --conf spark.executor.threadDumpCollector.enabled=true --conf spark.threadDumpCollector.interval=15s --conf spark.threadDumpCollector.output.type=LOG --conf spark.threadDumpCollector.include.regex=something Only those thread dumps will be captured which match the given regular expression (spark.threadDumpCollector.include.regex) ### Does this PR introduce _any_ user-facing change? Yes, see above ### How was this patch tested? New unit tests have been created and it has been manually tested as well. ### Was this patch authored or co-authored using generative AI tooling? No

github-actions bot added the CORE label May 16, 2025

roczei force-pushed the SPARK-52185 branch 3 times, most recently from 70eedbf to 76cdbac Compare May 17, 2025 05:07

roczei force-pushed the SPARK-52185 branch from 76cdbac to d32d6cc Compare May 17, 2025 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919

[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919

roczei commented May 16, 2025 •

edited

Loading

[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919

Are you sure you want to change the base?

[SPARK-52185][CORE] Automate the thread dump collection for Spark applications #50919

Conversation

roczei commented May 16, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

roczei commented May 16, 2025 •

edited

Loading