Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling tool: Add support for driver log as input to generate unsupported operators report #654

Merged
merged 8 commits into from
Nov 15, 2023

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Nov 8, 2023

This fixes #580

This adds support for reading driver log as input and generating unsupported operators in a csv file.
New argument driverlog is added. Eventlog is not mandatory anymore. Either driverlog or eventlogs can be passed in command line. Tool throws an error if both driverlog and eventlogs is not provided.

Output of driverlog is posted in a different directory - rapids_4_spark_profile/driver/unsupported_operators_in_driver_log.csv
Sample csv output:

operatorName,count,reason
"CollectLimitExec",2,"the Exec CollectLimitExec has been disabled, and is disabled by default"
"ProjectExec",2,"not all expressions can be replaced"
"FormatNumber",1,"format_number with floating point types on the GPU returns results that have a different precision than the default results of Spark. To enable this operation on the GPU, set spark.rapids.sql.formatNumberFloat.enabled to true."
"Hex",1,"GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.Hex"

Added a unit test to read driver log file with driverlog argument.

@nartal1 nartal1 added the core_tools Scope the core module (scala) label Nov 8, 2023
@nartal1 nartal1 self-assigned this Nov 8, 2023
@nartal1 nartal1 changed the title Profiler tool: Add support for driver log as input to generate unsupported operators report Profiling tool: Add support for driver log as input to generate unsupported operators report Nov 8, 2023
try {
// Process each line in the file
for (line <- source.getLines()) {
if (line.contains("cannot run on GPU") &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should protect against case sensitivity. this is also very sensitive to us changing the wording here. Can you check with Bobby on ideas to make this more resilient and to make sure this covers all the cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code was not written with the goal of parsing the output. "cannot run on GPU" is a clue that something went bad, but there is no guarantee that it will show up. Each expression has the ability to override how it shows up in the output. That was to allow attribute references and literals to not show up unless they are falling back to the CPU even in the case when you say to show ALL. If it does not show up I would consider it a bug, but just grepping through for that phrase I see it in multiple places. Especially with deltalake, which makes me wonder if it could show up in other places, or if it is guaranteed to show up.

https://github.com/NVIDIA/spark-rapids/blob/25ffa873ad6211f14d5e402805e537b52d60d2dc/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala#L336

makes me think it will always show up, but then it is possible that it will not.

But "not all expressions can be replaced" is a free form message. And there can be multiple reasons why something is not on the GPU. We could have a project that is not on the GPU because a child expression cannot be replaces AND a datatype that we do not support AND.... That is not robust at all.

https://docs.nvidia.com/spark-rapids/user-guide/23.10/faq.html#how-can-i-tell-what-will-run-on-the-gpu-and-what-will-not-run-on-it

kind of explains the format, but it is not perfect. it does not say the explanation is a ; separated list of explanations of why this didn't work, and we don't actually restrict ; from showing up in an explanation, or in the operator? section. I don't know how to make parsing this robust without changes to the format to make it less ambiguous, which if we are going to do that, then lets put it into the history file instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change the format for the future if we are planning on having tools about it. Maybe something more official tags [NOT ON GPU] that is easily parseable and unique. Obviously will still have to keep old logic for anything up til that get implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue in spark-rapids - NVIDIA/spark-rapids#9728

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large of an eventlog have you tested this with? I'm curious how long it takes in an app that takes a really long time.

@nartal1
Copy link
Collaborator Author

nartal1 commented Nov 15, 2023

how large of an eventlog have you tested this with? I'm curious how long it takes in an app that takes a really long time.

I tested it with eventlog of size 734MB and driver log 88MB. It took about a minute to complete.

Driver log only: 88M
real	0m1.404s
user	0m3.463s
sys	0m0.297s

eventlog:734M and driverlog: 88M
real	1m17.245s
user	1m7.288s
sys	0m49.898s

nartal1 and others added 3 commits November 14, 2023 22:33
Co-authored-by: Thomas Graves <[email protected]>
Co-authored-by: Thomas Graves <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for driver log as input to profiler to generate unsupported operations detailed report
3 participants