Qualification tool: Add penalty for row conversions #471

nartal1 · 2023-08-02T00:05:42Z

This fixes #385

Until now, the qualification tool didn't consider the time taken if there are CPU fallbacks(ColumnarToRow conversions) due to Execs not supported on GPU and RowToColumnar conversions. This PR attempts to add these to total durations so that the qualification tool speedup estimation can be closer to the actual speedup for Spark jobs.

There could be multiple transitions within a stage. i.e there could be few Execs supported on GPU and other which are not supported. But we cannot get the durations per Exec from the Spark metrics i.e there is no mapping between tasks and Execs. We know the input size of each stage whether is read from external source or when there is shuffle. So current implementation takes total input size for each stage and total number of transitions to estimate the total time taken by the transitions.
Have to run benchmarks to get the accurate transfer speed between CPU and GPU and vice versa. Have added the transfer rate based on the speedups obeserved on couple of eventlogs.

Signed-off-by: Niranjan Artal <[email protected]>

…_bug

tgravescs

is there any other information in the execs like number of rows in the examples we are trying to hit?

I would definitely like to see what output here is on various other workloads. I guess as long as we are more conservative its ok. I'm worried about the cases we read a ton of data but then filter it smaller very quickly, this may be very off.

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

tgravescs · 2023-08-04T14:49:34Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+            stageIdToTaskEndSum.get(stageId).map(_.totalbytesRead).getOrElse(0L).toDouble
+          if (totalBytesRead > 0) {
+            val fallback_duration = (totalBytesRead / QualificationAppInfo.CPU_GPU_TRANSFER_RATE) *
+              QualificationAppInfo.SECONDS_TO_MILLISECONDS * gpuCpuTransitions


there is a toMillis and others in java.util.concurrent.TimeUnit

tgravescs · 2023-08-04T14:51:28Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+          // Assuming it's a PCI-E Gen3, but also assuming that some of the result could be
+          // spilled to disk.
+          // Duration in Spark metrics is in millisecond, so multiply this by 1000 to make
+          // it consistent


expand comment as its kind of left hanging, make it consistent between what 2 things

Updated the comment.

tgravescs · 2023-08-04T14:58:25Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+      val topLevelExecs = execs.filterNot(x => x.exec.startsWith("WholeStage"))
+      val childrenExecs = execs.flatMap(_.children).flatten
+      val allExecs = topLevelExecs ++ childrenExecs
+      val transitions = allExecs.zip(allExecs.drop(1)).count {


add comment about what this is doing, this is definitely relying on the order and I'm not sure we are great about making sure that is guaranteed. If its required we need to make sure its documented that it has to be in the right order. It would also be nice to have a test to make sure its in the right order when it gets here to make sure someone doesn't break it. I was originally thinking this would be in plan parser but if this is easier I'm ok with it as long as its not brittle.

Added comment. You are right that ordered needs to be preserved. Will add a test.

The order is not preserved in all the cases. Need to fix this.

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

tgravescs · 2023-08-04T15:27:25Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+      val childrenExecs = execs.flatMap(_.children).flatten
+      val allExecs = topLevelExecs ++ childrenExecs
+      val transitions = allExecs.zip(allExecs.drop(1)).count {
+        case (exec1, exec2) =>


rename to be like currExec and nextExec, might help readability

tgravescs · 2023-08-04T15:34:20Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

      StageQualSummaryInfo(stageId, allSpeedupFactorAvg, stageTaskTime,
-        eachStageUnsupported, estimated)
+        eachStageUnsupported + transitionsTime, estimated)


it might be nice to report the number of transition we expect in the qual tool stage output, the idea behind that output was to be able to debug or figure out why we came out with a certain recommendation number

Thanks for the suggestion. Added a column for number of transitions in qual tool stage output .

…pu_transition_time

Signed-off-by: Niranjan Artal <[email protected]>

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

amahussein · 2023-08-18T16:59:29Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

@@ -241,13 +262,74 @@ class QualificationAppInfo(
    stages.map { stageId =>
      val stageTaskTime = stageIdToTaskEndSum.get(stageId)
        .map(_.totalTaskDuration).getOrElse(0L)
+      val numTransitions = stageIdToGpuCpuTransitions.getOrElse(stageId, 0)


If we have a flag that forces numTransitions to be 0, then theoretically we can disable the fall-back penalty, right?

Updated the PR which adds a config for transitions. Default is true, we can disable by setting the config. Please let me know if it's fine.

core/src/test/scala/com/nvidia/spark/rapids/tool/planparser/SqlPlanParserSuite.scala

amahussein · 2023-08-22T22:39:35Z

Needs an up-merge to get rid of the python failures

…pu_transition_time

amahussein

Thanks @nartal1 !

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

tgravescs · 2023-08-23T21:14:15Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

@@ -174,6 +179,23 @@ class QualificationAppInfo(
    res
  }

+  private def calculateNoExecsStageDurations(all: Seq[StageQualSummaryInfo]): Long = {


nit rename to calculateExecsNoStageDurations or actually this is durations due to transitions? then it should have something liek that in the name.

Removed this method. Filed a follow on issue to add penalties for execs not associated with any stages - #514

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

tgravescs · 2023-08-23T21:17:26Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+        case false => stageIdToGpuCpuTransitions.getOrElse(stageId, 0)
+        case true => 0
+      }
+      // val numTransitions = stageIdToGpuCpuTransitions.getOrElse(stageId, 0)


remove commented out line

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

tgravescs · 2023-08-24T16:39:02Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+      // Update totaltaskduration of stageIdToTaskEndSum to include transitions time
+      val stageIdToTasksMetrics = stageIdToTaskEndSum.get(stageId).orElse(None)
+      if (stageIdToTasksMetrics.isDefined) {
+        stageIdToTasksMetrics.get.totalTaskDuration += transitionsTime


I'm confused by this, why are we changing the task durations here? this has traditionally been the real task durations then we add/remove things later. Is this because supported + unsupported is now longer due to the transition Time? it seems odd to change it in this datastructure that is the real values from file.

This is done as we are adding transitionTime to unsupportedTaskDuration i.e unsupportedTaskDuration=eachStageUnsupported + transitionsTime . So the totalTaskDuration should also incude transitionTime else we will end up in a case where unsupportedTaskDuration > totalTaskDuration (which would be incorrect)

Updated this code. Now we are considering transitionTime in unsupportedDurations only. stageTaskTime is the totalTaskDuration from the eventlog. Returning transitionTime from StageQualSummaryInfo so that it could be used in calculation of calculateNonSQLTaskDataframeDuration

tgravescs · 2023-08-24T16:59:25Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

@@ -161,11 +166,11 @@ class QualificationAppInfo(
  }

  private def calculateSQLSupportedTaskDuration(all: Seq[StageQualSummaryInfo]): Long = {
-    all.map(s => s.stageTaskTime - s.unsupportedTaskDur).sum
+    all.map(s => s.stageTaskTime - s.unsupportedTaskDur).sum - calculateNoExecsStageDurations(all)


so I'm a bit unclear how this work with the job overhead we add later and/or the mapping we try to do with execs without stages already.

is this counting it twice?

Removed this.

tgravescs · 2023-08-24T17:02:14Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

+  private def calculateNoExecsStageDurations(all: Seq[StageQualSummaryInfo]): Long = {
+    // If there are Execs not associated with any stage, then some of the Execs may not be
+    // supported on GPU.  We need to estimate the duration of these Execs and add it to the
+    // unsupportedTaskDur. We estimate the duration by taking the average of the unsupportedTaskDur


I'm not sure I follow this estimation. We are trying to give some penalty for execs that have transitions but don't map to a stage (ie we don't have a duration), correct?

I'm wondering if we are already calculating this with like either the stages with no execs or job overhead time.

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 · 2023-09-06T19:08:02Z

Marking this as draft for doing more tests to get better a bandwidth number.

…pu_transition_time

Signed-off-by: Niranjan Artal <[email protected]>

…pu_transition_time

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualificationArgs.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

This reverts commit 7e1ebe0.

…pu_transition_time

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualificationMain.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualificationAppInfo.scala

…pu_transition_time

nartal1 · 2023-10-16T16:32:02Z

Thanks for the review @tgravescs and @mattahrens! Merging this.

nartal1 added 2 commits August 1, 2023 16:43

Qualification tool: Add penalty for row conversions

5008ad8

Signed-off-by: Niranjan Artal <[email protected]>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into apple…

a32d81c

…_bug

nartal1 added the core_tools Scope the core module (scala) label Aug 2, 2023

nartal1 self-assigned this Aug 2, 2023

nartal1 marked this pull request as draft August 2, 2023 00:06

optimize code

af962b8

nartal1 marked this pull request as ready for review August 2, 2023 21:21

nartal1 requested review from mattahrens and amahussein August 2, 2023 21:21

tgravescs reviewed Aug 4, 2023

View reviewed changes

nartal1 added 3 commits August 7, 2023 18:31

addressed review comments

1ae0eaa

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

7da1acf

…pu_transition_time

addressed review comments and added test

e2dbb12

Signed-off-by: Niranjan Artal <[email protected]>

amahussein reviewed Aug 18, 2023

View reviewed changes

amahussein mentioned this pull request Aug 18, 2023

[BUG] Unit-test csv for unsupported operators fails on spark340 #498

Closed

fix unit test

29f0e8c

nartal1 added 2 commits August 22, 2023 16:01

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

a52803a

…pu_transition_time

addressed review comments

7b6803f

amahussein previously approved these changes Aug 23, 2023

View reviewed changes

tgravescs reviewed Aug 24, 2023

View reviewed changes

nartal1 mentioned this pull request Aug 28, 2023

Qualification tool: Add penalty for execs/operators not associated with stages #514

Closed

addressed review comments and updated test results

2607489

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 dismissed amahussein’s stale review via 2607489 August 28, 2023 02:05

nartal1 marked this pull request as draft September 6, 2023 19:08

nartal1 added 4 commits September 19, 2023 09:57

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

f8a867f

…pu_transition_time

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

0c5b342

…pu_transition_time

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

89fccf4

…pu_transition_time

Address review comments

2ba211a

Signed-off-by: Niranjan Artal <[email protected]>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

5e06370

…pu_transition_time

nartal1 marked this pull request as ready for review October 9, 2023 18:24

nartal1 requested a review from tgravescs October 10, 2023 02:03

tgravescs reviewed Oct 10, 2023

View reviewed changes

nartal1 added 6 commits October 10, 2023 15:04

address review comments

8dfa245

update tests

7e1ebe0

Revert "update tests"

132605f

This reverts commit 7e1ebe0.

add penalty to durations

f608365

change transitiontime calculation

f8c8d40

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

db6eaf0

…pu_transition_time

tgravescs reviewed Oct 13, 2023

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualificationMain.scala Outdated Show resolved Hide resolved

update variable name

36f8d56

tgravescs reviewed Oct 13, 2023

View reviewed changes

nartal1 added 2 commits October 13, 2023 11:28

addressed review comments

4a8ab84

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into cpu_g…

b17df32

…pu_transition_time

nartal1 mentioned this pull request Oct 16, 2023

Qualification tool: Investigate if number of rows could be used for adding penalty #620

Open

change penaly percentage

052b22f

tgravescs approved these changes Oct 16, 2023

View reviewed changes

mattahrens approved these changes Oct 16, 2023

View reviewed changes

nartal1 merged commit 89361b1 into NVIDIA:dev Oct 16, 2023
8 checks passed

nartal1 mentioned this pull request Dec 14, 2023

[BUG] Sync-up HTML report to mirror the CSV files #692

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification tool: Add penalty for row conversions #471

Qualification tool: Add penalty for row conversions #471

nartal1 commented Aug 2, 2023 •

edited

Loading

tgravescs left a comment

tgravescs Aug 4, 2023

nartal1 Aug 8, 2023

tgravescs Aug 4, 2023

nartal1 Aug 8, 2023

tgravescs Aug 4, 2023

nartal1 Aug 8, 2023 •

edited

Loading

nartal1 Aug 9, 2023

tgravescs Aug 4, 2023

nartal1 Aug 8, 2023

tgravescs Aug 4, 2023

nartal1 Aug 8, 2023

amahussein Aug 18, 2023

nartal1 Aug 23, 2023

amahussein commented Aug 22, 2023

amahussein left a comment

tgravescs Aug 23, 2023

nartal1 Aug 28, 2023

tgravescs Aug 23, 2023

nartal1 Aug 28, 2023

tgravescs Aug 24, 2023

nartal1 Aug 25, 2023

nartal1 Aug 28, 2023 •

edited

Loading

tgravescs Aug 24, 2023

nartal1 Aug 28, 2023

tgravescs Aug 24, 2023

nartal1 commented Sep 6, 2023

nartal1 commented Oct 16, 2023

Qualification tool: Add penalty for row conversions #471

Qualification tool: Add penalty for row conversions #471

Conversation

nartal1 commented Aug 2, 2023 • edited Loading

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nartal1 Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amahussein commented Aug 22, 2023

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nartal1 Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nartal1 commented Sep 6, 2023

nartal1 commented Oct 16, 2023

nartal1 commented Aug 2, 2023 •

edited

Loading

nartal1 Aug 8, 2023 •

edited

Loading

nartal1 Aug 28, 2023 •

edited

Loading