Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] self-profiler hangs on spark-submit when trying to profile an entire job #11871

Open
thirtiseven opened this issue Dec 13, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@thirtiseven
Copy link
Collaborator

Describe the bug
I would like to profile a pyspark job with self-profiler, here's the command I use:

spark-submit --master local[*] --jars ${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.enabled=true --conf spark.rapids.sql.explain=ALL --conf spark.rapids.profile.pathPrefix=file:///home/haoyangl/rapids-nsys --conf spark.rapids.profile.executors=0,driver --conf spark.rapids.profile.compression=zstd --conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true" --conf spark.driver.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true" test_profiler.py

Here is the pyspark job, but I think any kind of job will meet this issue:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test").getOrCreate()
df = spark.createDataFrame([("a",)], ["a"])

df.write.parquet("TEST")

df = spark.read.parquet("TEST")

df.selectExpr("ascii(a) AS ascii_value").show()

but the spark-submit hangs on

24/12/13 16:45:59 INFO DAGScheduler: Job 2 finished: showString at NativeMethodAccessorImpl.java:0, took 0.081151 s
24/12/13 16:45:59 INFO CodeGenerator: Code generated in 13.786791 ms
+-----------+
|ascii_value|
+-----------+
|         97|
+-----------+

Then control-C:

^C24/12/13 16:46:22 INFO RapidsBufferCatalog: Closing storage
24/12/13 16:46:24 WARN ProfileWriter: Profiling completed, output written to file:/home/haoyangl/rapids-nsys/[email protected]
24/12/13 16:46:24 WARN ProfilerOnDriver: Profiling: Executor driver ended profiling, profile written to file:/home/haoyangl/rapids-nsys/[email protected]
24/12/13 16:46:24 INFO AwsStorageExecutorPlugin: Shutting down S3 Plugin ... 
24/12/13 16:46:24 INFO SparkContext: Invoking stop() from shutdown hook
24/12/13 16:46:24 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/12/13 16:46:24 INFO SparkUI: Stopped Spark web UI at http://spark-haoyang:4040

and everything works fine in the next steps.

However, when adding a stage limit, it does not hang:

spark-submit --master local[*] --jars ${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.enabled=true --conf spark.rapids.sql.explain=ALL --conf spark.rapids.profile.pathPrefix=file:///home/haoyangl/rapids-nsys --conf spark.rapids.profile.executors=0,driver --conf spark.rapids.profile.stages=1,2,3,4 --conf spark.rapids.profile.compression=zstd --conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true" --conf spark.driver.extraJavaOptions="-Dai.rapids.cudf.nvtx.enabled=true" test_profiler.py
@thirtiseven thirtiseven added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 13, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 17, 2024
@mattahrens
Copy link
Collaborator

Have you tried reverting the PR related to stage limits to see if the problem is not reproducible? #11708

@thirtiseven
Copy link
Collaborator Author

Have you tried reverting the PR related to stage limits to see if the problem is not reproducible? #11708

Yes it can be reproduced without 11708. I will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants