Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report all operators in the output file #1444

Merged
merged 8 commits into from
Dec 4, 2024

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Dec 3, 2024

This fixes #1325 . This is a follow-on PR to capture the expressions and save it to output file.

This PR supercedes #1431. Thanks @amahussein for resolving and fixing some of the issues in the previous PR.

In this PR, we print all the operators per app and per sqlID in a new file. This helps to get the count of operators in an application. It has count of both supported and unsupported operators.

Sample output:

App ID,SQL ID,Operator Type,Operator Name,Count,Supported,Stages
"app-1",0,"Exec","Execute AddJarCommand",1,false,""
"app-1",1,"Exec","Execute AddJarCommand",1,false,""
"app-1",2,"Exec","Execute AddJarCommand",1,false,""
"app-1",3,"Exec","HashAggregate",2,true,"2:0"
"app-1",3,"Expr","count",2,true,"2:0"
"app-1",3,"Exec","AdaptiveSparkPlan",1,false,"2"
"app-1",3,"Expr","EqualTo",1,true,"0"
"app-1",3,"Exec","Exchange",1,true,"0"
"app-1",3,"Exec","Filter",1,true,"0"
"app-1",3,"Exec","Project",1,true,"0"
"app-1",3,"Exec","Scan jdbc",1,false,"0"
"app-1",4,"Exec","CollectLimit",1,false,"3"
"app-1",26,"Exec","Project",2,true,"26"
"app-1",26,"Expr","explode",2,true,"26"

Some of the changes in this PR:

  1. Introduced OperatorRef class which includes information about both Execs and Expressions.
  2. Updated object ExecInfo to store references.
  3. Separate maps to capture supported and unsupported operators.
  4. Used regex to replaceFirst node name in remaining parser classes.
  5. FilesourcescanExec
    a. It was using nodeName as an execNAme which causes the node to look like Scan JDBCRelation()[hfsdhfjhkhf -> after the fix it is Scan jdbc
    b. If the readformat is unknown, we will put the node.desc to help us understand why we cannot extract the readformat
  6. BatchScan
    a. It was not setting correct OpType. It was OpType.Exec instead of OpType.ReadExec.
    b. Applied the same naming logic in FileSourceScanExec.
  7. WholeStageCodeGen:
    a. It was setting expressions/unsupportedExpressions as the union of its children. Now those values are empty because they are part of the children.
    b. set the execName to be WholeStageCodeGen or PhotonResultStage instead of WholeStageCodeGen ({nodeID})
    c. The expression will be set to NodeName (nodeID)

This pull request includes several updates to improve the parsing and handling of execution nodes in the RAPIDS Accelerator for Apache Spark. The changes focus on refining the parsing logic, handling unsupported expressions, and enhancing the formatting and readability of the code.

Improvements to Execution Node Parsing:

Handling Unsupported Expressions:

Code Formatting and Readability:

These changes collectively enhance the robustness and clarity of the code, making it easier to maintain and extend in the future.

amahussein and others added 4 commits December 3, 2024 13:39
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 added feature request New feature or request core_tools Scope the core module (scala) affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) labels Dec 3, 2024
@nartal1 nartal1 requested a review from amahussein December 3, 2024 21:58
@nartal1 nartal1 self-assigned this Dec 3, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 !
Let's get @leewyang feedback on the changes in the python file.

@amahussein amahussein requested a review from leewyang December 4, 2024 14:53
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the fix the logic in looping on the graph nodes to build DSV1 that we discussed offline?

The filters in the code below should be swapped.

val scanNode = allNodes.filter(node => {
// Get ReadSchema of each Node and sanitize it for comparison
val trimmedNode = AppBase.trimSchema(ReadParser.parseReadNode(node).schema)
readSchema.contains(trimmedNode)
}).filter(ReadParser.isScanNode(_))

It should become:

      val scanNode = allNodes.filter(ReadParser.isScanNode(_)).filter(node => {
        // Get ReadSchema of each Node and sanitize it for comparison
        val trimmedNode = AppBase.trimSchema(ReadParser.parseReadNode(node).schema)
        readSchema.contains(trimmedNode)
      })

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 !
I approve the scala side changes since I am not available in a couple of hours.
We should merge the PR once Lee approves the python side changes.
Thanks!

@nartal1 nartal1 merged commit 993bc8f into NVIDIA:dev Dec 4, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool: Report supported expressions in the output file
3 participants