Report all operators in the output file #1444

nartal1 · 2024-12-03T21:58:10Z

This fixes #1325 . This is a follow-on PR to capture the expressions and save it to output file.

This PR supercedes #1431. Thanks @amahussein for resolving and fixing some of the issues in the previous PR.

In this PR, we print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators.

Sample output:

App ID,SQL ID,Operator Type,Operator Name,Count,Supported,Stages
"app-1",0,"Exec","Execute AddJarCommand",1,false,""
"app-1",1,"Exec","Execute AddJarCommand",1,false,""
"app-1",2,"Exec","Execute AddJarCommand",1,false,""
"app-1",3,"Exec","HashAggregate",2,true,"2:0"
"app-1",3,"Expr","count",2,true,"2:0"
"app-1",3,"Exec","AdaptiveSparkPlan",1,false,"2"
"app-1",3,"Expr","EqualTo",1,true,"0"
"app-1",3,"Exec","Exchange",1,true,"0"
"app-1",3,"Exec","Filter",1,true,"0"
"app-1",3,"Exec","Project",1,true,"0"
"app-1",3,"Exec","Scan jdbc",1,false,"0"
"app-1",4,"Exec","CollectLimit",1,false,"3"
"app-1",26,"Exec","Project",2,true,"26"
"app-1",26,"Expr","explode",2,true,"26"

Some of the changes in this PR:

Introduced OperatorRef class which includes information about both Execs and Expressions.
Updated object ExecInfo to store references.
Separate maps to capture supported and unsupported operators.
Used regex to replaceFirst node name in remaining parser classes.
FilesourcescanExec
a. It was using nodeName as an execNAme which causes the node to look like Scan JDBCRelation()[hfsdhfjhkhf -> after the fix it is Scan jdbc
b. If the readformat is unknown, we will put the node.desc to help us understand why we cannot extract the readformat
BatchScan
a. It was not setting correct OpType. It was OpType.Exec instead of OpType.ReadExec.
b. Applied the same naming logic in FileSourceScanExec.
WholeStageCodeGen:
a. It was setting expressions/unsupportedExpressions as the union of its children. Now those values are empty because they are part of the children.
b. set the execName to be WholeStageCodeGen or PhotonResultStage instead of WholeStageCodeGen ({nodeID})
c. The expression will be set to NodeName (nodeID)

This pull request includes several updates to improve the parsing and handling of execution nodes in the RAPIDS Accelerator for Apache Spark. The changes focus on refining the parsing logic, handling unsupported expressions, and enhancing the formatting and readability of the code.

Improvements to Execution Node Parsing:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/BatchScanExecParser.scala: Updated the BatchScanExecParser to use a more concise node name and improved the logic for setting execution expressions based on the read format. [1] [2]
core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/FileSourceScanExecParser.scala: Enhanced the FileSourceScanExecParser to handle node names more accurately and set execution expressions based on the read format, improving troubleshooting capabilities. [1] [2] [3]

Handling Unsupported Expressions:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/ExecParser.scala: Modified the ExecParser trait to use UnsupportedExprOpRef instead of UnsupportedExpr for unsupported expression reasons. [1] [2]
core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/GenericExecParser.scala: Updated the GenericExecParser to utilize UnsupportedExprOpRef and include expressions in the createExecInfo method. [1] [2] [3] [4] [5]

Code Formatting and Readability:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/SQLPlanParser.scala: Refactored the ExecInfo case class to use OpRef and UnsupportedExprOpRef, and added methods to improve readability and consistency. [1] [2] [3] [4] [5] [6]

These changes collectively enhance the robustness and clarity of the code, making it easier to maintain and extend in the future.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Signed-off-by: Niranjan Artal <[email protected]>

amahussein

Thanks @nartal1 !
Let's get @leewyang feedback on the changes in the python file.

user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/ops/OpRef.scala

amahussein

Can we add the fix the logic in looping on the graph nodes to build DSV1 that we discussed offline?

The filters in the code below should be swapped.

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

Lines 381 to 385 in 091af08

    
           val scanNode = allNodes.filter(node => { 
        
             // Get ReadSchema of each Node and sanitize it for comparison 
        
             val trimmedNode = AppBase.trimSchema(ReadParser.parseReadNode(node).schema) 
        
             readSchema.contains(trimmedNode) 
        
           }).filter(ReadParser.isScanNode(_))

It should become:

      val scanNode = allNodes.filter(ReadParser.isScanNode(_)).filter(node => {
        // Get ReadSchema of each Node and sanitize it for comparison
        val trimmedNode = AppBase.trimSchema(ReadParser.parseReadNode(node).schema)
        readSchema.contains(trimmedNode)
      })

amahussein

Thanks @nartal1 !
I approve the scala side changes since I am not available in a couple of hours.
We should merge the PR once Lee approves the python side changes.
Thanks!

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

amahussein and others added 4 commits December 3, 2024 13:39

first iterations all the UTs pass

2bb2643

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

running end-to-end

344bfa9

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

cleaned-up exec parsers

d862838

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

update documentation and fix

1063058

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 added feature request New feature or request core_tools Scope the core module (scala) api_change A change affecting the output (add/remove/rename files, add/remove/rename columns) labels Dec 3, 2024

nartal1 requested a review from amahussein December 3, 2024 21:58

nartal1 self-assigned this Dec 3, 2024

nartal1 mentioned this pull request Dec 3, 2024

Report all operators in the output file #1431

Closed

nartal1 added 3 commits December 3, 2024 15:33

qualx related changes

1d1f0fb

update qualx changes

1b3f5dd

Fix for Scan OneRowRelation

eb919da

Signed-off-by: Niranjan Artal <[email protected]>

amahussein reviewed Dec 4, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py Show resolved Hide resolved

user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py Show resolved Hide resolved

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/ops/OpRef.scala Outdated Show resolved Hide resolved

amahussein requested a review from leewyang December 4, 2024 14:53

amahussein requested changes Dec 4, 2024

View reviewed changes

addressed review comments

a2cc670

amahussein approved these changes Dec 4, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala Show resolved Hide resolved

nartal1 merged commit 993bc8f into NVIDIA:dev Dec 4, 2024
15 checks passed

amahussein mentioned this pull request Dec 5, 2024

[BUG] Count expressions within execs in SqlPlanParser #1447

Closed

leewyang mentioned this pull request Dec 5, 2024

Update models for latest tools code #1448

Merged

nartal1 mentioned this pull request Dec 6, 2024

[FEA] Qualification tool: Add operators stats output csv file #1157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report all operators in the output file #1444

Report all operators in the output file #1444

Uh oh!

nartal1 commented Dec 3, 2024

Uh oh!

amahussein left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amahussein left a comment

Uh oh!

amahussein left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	val scanNode = allNodes.filter(node => {
	// Get ReadSchema of each Node and sanitize it for comparison
	val trimmedNode = AppBase.trimSchema(ReadParser.parseReadNode(node).schema)
	readSchema.contains(trimmedNode)
	}).filter(ReadParser.isScanNode(_))

Report all operators in the output file #1444

Report all operators in the output file #1444

Uh oh!

Conversation

nartal1 commented Dec 3, 2024

Improvements to Execution Node Parsing:

Handling Unsupported Expressions:

Code Formatting and Readability:

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!