Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification tool: Add output stats file for Execs(operators) #1225

Merged
merged 16 commits into from
Aug 12, 2024

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Jul 24, 2024

This PR contributes to #1157 .

This PR has the following changes:

  1. It creates a new csv file (qualification_statistics.csv) with the qualification statistics run per-sql level for all Execs present in rapids_4_spark_qualification_output_execs.csv.

Changes made:

  1. Changes made in python user-tools to read qualification output files.
  2. Made changes in qualification.py to create SparkQualificationStats object passing the context. The logic of creating statistics and generating a csv out file is in SparkQualificationStats.
  3. merge_dataframes function in SparkQualificationStats has the code to generate the stats. After reading the csv files(stages, execs and unsupporteOperators) to dataframes, we do preprocessing of the dataframes such as filtering out the WholeStageCodegen and exploding the Exec Stage column.
  4. The report generated per App per SQL ID. Example of calculating the stats:
    For a given Exec(operator), determine all the stages where this Exec is present in that SQL. Get StageTaskDurations of
    all the stages this Exec is present to get the StageTaskExecDuration. We can also determine the total StageTaskDuration corresponding to that SQLID from execs.csv. Note that this stats is based on the stage to SQL mapping based on execs.csv file. We know that there could be some Execs to which we cannot map stages and will miss those out in the calculation/output. This is consistent with the current csv output files.

Output:

  1. Generate output file from user-tools cmd.
spark_rapids qualification --platform=onprem  --eventlogs=<EVENTLOGPATH>

Sample output:

App ID,SQL ID,Operator,Count,Stage Task Exec Duration(s),Total SQL Task Duration(s),% of Total SQL Task Duration,Supported
app-20240220043231-0000,1,BroadcastHashJoin,1,577418.50,5278814.00,10.94,True
app-20240220043231-0000,1,ColumnarToRow,5,1334460.475,1200.855,23.665,False
app-20240220043231-0000,5,Scan ExistingRDD,4,35.14,35.296,0.696,False
app-20240220043231-0000,1,Scan parquet,5,1334460.48,5278814.00,25.28,True
app-20240220043231-0000,15,ObjectHashAggregate,5,15.236,4.502,0.089,False
app-20240220043231-0000,15,Scan ExistingRDD,4,14.852,4.072,0.08,False
app-20240220043231-0000,15,filesizehistogramagg,5,15.236,4.502,0.089,False
app-20240220043231-0000,20,Project,2,856774.69,865275.43,99.02,True

###Output Columns description:

Operator - Name of the Exec
Count  - Number of times the operator is executed in the SQL. It also depends on whether the Exec is supported/not supported in any of the stages in that SQL
Stage task exec duration -  task duration for the stage in seconds
Total SQL Task Duration -  -  total of all task durations for all stages in that SQL id?
% of Total SQL Task Duration -( Stage Task Exec Duration/Totatl SQL Task Duration) * 100
Supported: Whether the operator is supported or not. 
  1. Generate output file using python cmd(Assumption that qual run was completed earlier).
1. With `qual_output` argument 
    spark_rapids_dev stats --qual_output=/home/test/<QUAL_OUTPUT_DIRECTORY>/rapids_4_spark_qualification_output 

2. `output_folder` arg where the results to be stored.
spark_rapids stats --qual_output=/home/test/<QUAL_OUTPUT_DIRECTORY>/rapids_4_spark_qualification_output --output_folder=/home/test/output_directory
  

Follow on

  1. Discuss how to report the operators which do not map to any stages. In unsupported_operators.csv, we mark those stages as -1. One approach is we could display totalStageTaskDuration of the SQL ID to which this operator maps to and keep the columns - StageTaskExecDuration and % of Total SQL Duration as zero

@nartal1 nartal1 added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 24, 2024
@nartal1 nartal1 self-assigned this Jul 24, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 for this feature.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 !
Do we create this as a separate tool similar to what we did for "Predict"?
If we treat it as independent tool, then it can have its own yaml file and it can be easier to configure and extend without affecting the qualification cmd.

We can also allow this post-phase processing to be enabled/disabled.

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

@nartal1
Copy link
Collaborator Author

nartal1 commented Jul 25, 2024

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience.
@amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

@amahussein
Copy link
Collaborator

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience. @amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

Yes, I meant that it runs as part of the Qualification tool (enabled by default).
But there is a command to allow running the stats report on existing directory path. This is what we did for "prediction". The latter is part of the processing of the qualification, but it is also possible to run prediction on existing tools folders. This is to allow getting reports on older runs and to reiterate on the stats without having to go through the entire cycle.

One last thing. How to keep compatibility? for example, it is possible that the stats is scanning a folder that was generated by older tools version. in that case, it might crash.
Any thoughts about that?

@nartal1
Copy link
Collaborator Author

nartal1 commented Jul 29, 2024

Thanks @parthosa for the review. I have addressed all the review comments. PTAL.

@nartal1 nartal1 requested a review from parthosa July 29, 2024 23:45
@tgravescs
Copy link
Collaborator

Impacted Stage duration is sum of all stage durations for the SQLID.

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per sql

@nartal1
Copy link
Collaborator Author

nartal1 commented Aug 1, 2024

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per sql

Looking into this if we can get % total task time per sql

@nartal1 nartal1 changed the title Qualification tool: Add output stats file for unsupported operators Qualification tool: Add output stats file for Execs(operators) Aug 6, 2024
@nartal1
Copy link
Collaborator Author

nartal1 commented Aug 7, 2024

Stage durations here seem like they would be very hard to get accurate. I think we should think about just doing task time and potentially % total task time per SQL

@tgravescs - Updated the PR to report % total task time.
@parthosa, @amahussein @cindyyuanjiang - This includes all the operators where we can map stageID to Execs. Had to include execs.csv file and some changes while merging the dataframes. PTAL.

@nartal1 nartal1 requested a review from tgravescs August 7, 2024 20:25
@cindyyuanjiang cindyyuanjiang self-requested a review August 8, 2024 00:29
@tgravescs
Copy link
Collaborator

can you update sample output in description?

@nartal1
Copy link
Collaborator Author

nartal1 commented Aug 8, 2024

can you update sample output in description?
Thanks @tgravescs ! I had updated the sample output in dropdown section earlier. Now it's in the description itself. PTAL.

parthosa
parthosa previously approved these changes Aug 8, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1. The final output columns are quite meaningful.

@cindyyuanjiang cindyyuanjiang self-requested a review August 8, 2024 18:03
cindyyuanjiang
cindyyuanjiang previously approved these changes Aug 8, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 for this feature!

@tgravescs
Copy link
Collaborator

Can you please describe each of these columns:

Stage Task Exec Duration(s),Total SQL Duration(s),% of Total SQL Duration,Supported
577418.50,5278814.00,10.94,True

  • Stage task exec duration - is the task duration for the stage in seconds
  • Total SQL Duration - ??? - is this the total of all task durations for all stages in that SQL id?
  • % of Total SQL Duration - I assume just percent based on above 2 numbers

I want to make sure the columns are clear to the user. from just reading it I would have expected the total SQL duration to just be the wall clock time but from previous discussions that isn't what we were talking about so want to make sure and then maybe come up with a name that is more obvious to the user.

@tgravescs
Copy link
Collaborator

It will be nice if we can get an initial quick feedback from @viadea to incorporate his input in the tuning of this PR.

Discussed with @viadea offline. His feedback is that having it as a part of Qualification tool should be sufficient so that the stats are produced along with the qual tool results. Asking users to run CLI just for generating the statistics file (given the qual tool is run) wouldn't be a good experience. @amahussein - Should we keep it as the part of Qual tool run? Or would it help to create a framework to run this as a separate tool ? Wanted to check if you have any preference.

Yes, I meant that it runs as part of the Qualification tool (enabled by default). But there is a command to allow running the stats report on existing directory path. This is what we did for "prediction". The latter is part of the processing of the qualification, but it is also possible to run prediction on existing tools folders. This is to allow getting reports on older runs and to reiterate on the stats without having to go through the entire cycle.

One last thing. How to keep compatibility? for example, it is possible that the stats is scanning a folder that was generated by older tools version. in that case, it might crash. Any thoughts about that?

ok if we want to have separate cli for users to run after the fact I guess I'm ok with it. it adds more things - documentation and testing mainly, but more commands always potential for user confusion. please make sure we have followups to add docs/tests.

I thought this requirement was mostly from Felix and would just get this with the normal qualification runs. Let me talk to Felix and Hao

user_tools/src/spark_rapids_tools/cmdli/tools_cli.py Outdated Show resolved Hide resolved
merged_df = merged_df.merge(total_duration_df, on=['App ID', 'SQL ID'], how='left')

# Mark unique stage task durations
merged_df['Unique StageTaskDuration'] = ~merged_df.duplicated(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for each operator we will get all stages and then possibly separate rows if have both supported and unsupported and we just want to make sure we have those unique so we aren't double counting on time, correct?

So does this mean if you have an two operators in a stage, one that is supported and one that is not supported that we the time will be in there twice, correct? which is fine I just want to make sure I'm understanding properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated it to handle this case. Until now, it was taking into account same operators within a SQL ID and the assumption was that the operator will be either supported or unsupported within a stage. But there could be a scenario where the same operator can be supported and unsupported within a stage(due to underlying Expression not supported).

Sample output: There are 4 entries of Project for SQLID=1, StageID(5,6) in execs.csv and 3 are unsupported as below:
unsupported_operators_df

    App ID  SQL ID  Stage ID       Operator
0       1       3         7  HashAggregate
1       1       1         6        Project
2       2       1        -1        Project
3       1       1        -1         Filter
4       1       1         5        Project
5       1       1         5        Project

stages_df:

 stages_df
   App ID  Stage ID  StageTaskDuration
0       1         5                 50
2       1         4                 20
3       1         3                100
4       1         6                 60
5       1         7                100

final dataframe output(before renaming columns):

      App ID  SQL ID     Operator      Count  StageTaskDuration  TotalSQLDuration  % of Total SQL Duration  Supported
0       1       1         Filter         2                70                230                30.434783       True
1       1       1        Project         3                110               230                47.826087      False
2       1       1        Project         1                 50               230                21.739130       True
3       1       1           Sort         3                170               230                73.913043       True
4       1       3  HashAggregate         1                100               100               100.000000      False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I assume the sql id 1 above example has some execs - likely the Project that are unsupposed in the same stages as another exec that is supported, correct? Because you hadd up the 4 of those 70 + 110 + 50 + 170 and its more then the 230 for total.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. In the above example For SQLID=1, there are 4 Projects in total.
StageID=5, Unsupported=2, Supported=1, StageTaskDuration=50
StageID=6, Unsupported=1, StageTaskDuration=60

So we have 2 rows in output column for Project.
StageTaskDuration=110 ( 50 + 60) for Unsupported
StageTaskDuration=50 for Supported

@amahussein
Copy link
Collaborator

ok if we want to have separate cli for users to run after the fact I guess I'm ok with it. it adds more things - documentation and testing mainly, but more commands always potential for user confusion. please make sure we have followups to add docs/tests.

I thought this requirement was mostly from Felix and would just get this with the normal qualification runs. Let me talk to Felix and Hao

That's a good point @tgravescs
In fact, I agree that this is better to be an internal CLI rapids_tools_dev (should be added to dev_cli.py). The purpose of that new CLI is for the internal team to be able to run the report on existing output folders.
It is unlikely that users will need to run that CLI.

@tgravescs
Copy link
Collaborator

yeah I'm good with a dev command for it. Talked to Hao and Felix and they agreed not needed the public/external command

Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 merged commit 32239bf into NVIDIA:dev Aug 12, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants