Improve implementation of finding median in StatisticsMetrics #1474

amahussein · 2024-12-19T18:57:16Z

Adds an InPlace median finding to improve the performance of the metric aggregates.
We used to sort a sequence to create StatisticsMetrics which turned out to be very expensive in large eventlogs.

It was found that there is a bottleneck in generateStageLevelAccums and generateSQLAccums due to the cost of sorting and using sequences of Long on metrics aggregates.

Impact on performance:

The PR improves the Qualification runtime. methods like

generateSQLAccums is down from 21,550 to 3,480 ms (~80% improvement)
generateStageLevelAccums is down from 115580 to 61,611 ms (~45% improvement)

Main optimization

Median finding: we used to convert a map to sequence. Then get it sorted. Then we pick the median, max, and min, and then we call seq.sum

This turned to be expensive for large eventlogs
This PR implements adds finding-median in linear time. On average, this is a linear complexity compared to sorting
This pull request includes several changes to the AppSQLPlanAnalyzer and AppSparkMetricsAnalyzer classes to improve performance and simplify the codebase by using the breakOut method for collection transformations and introducing new methods in StatisticsMetrics. Additionally, it removes an unused object and refactors the AccumInfo class.

Impact on output

By doing a diff on the output folder, it was found that the sql_plan_metrics_for_application.csv is differemt. The new output sounds more correct

In output generated by the dev branch:

appIndex,sqlID,nodeID,nodeName,accumulatorId,name,min,median,max,total,metricType,stageIds
1,236,0,"Execute InsertIntoHadoopFsRelationCommand",0,"number of written files",0,0,0,2,"sum","2"

The above seems incorrect because the total is non-zero while min, max and median are zeros.

Vs the new output

appIndex,sqlID,nodeID,nodeName,accumulatorId,name,min,median,max,total,metricType,stageIds
1,236,0,"Execute InsertIntoHadoopFsRelationCommand",0,"number of written files",2,2,2,2,"sum","2"

Improvements to Collection Transformations:

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala: Replaced intermediate collection variables with direct transformations using breakOut to improve performance. [1] [2] [3] [4] [5]
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala: Applied breakOut to streamline collection processing and reduce memory overhead. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Introduction of New Methods in `StatisticsMetrics`:

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/StatisticsMetrics.scala: Added createFromArr and createOptionalFromArr methods to compute statistics directly from arrays, improving code reuse and readability.

Refactoring and Cleanup:

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala: Refactored the calculateAccStats method to use the new createFromArr method from StatisticsMetrics, simplifying the code.
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala: Removed the unused AppSparkMetricsAnalyzer object and its getStatistics method.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#1461 Adds an InPlace median finding to improve the performance of the metric aggregates. We used to sort a sequence to create StatisticsMetrics which turned out to be very expensive in large eventlogs. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

parthosa

Thanks @amahussein for these improvements. Minor question.

parthosa · 2024-12-19T20:41:23Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/util/InPlaceMedianArrView.scala

+  }
+
+  override def toString = {
+    arr mkString ("ArraySize(", ", ", ")")


Q: Should this be bounded by until - from?

parthosa · 2024-12-19T20:52:07Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/util/InPlaceMedianArrView.scala

+   * @return the median of the array.
+   */
+  def findMedianInPlace(
+    arr: Array[Long])(implicit choosePivot: InPlaceMedianArrView => Long): Long = {


This is an interesting usage of implicit mechanism.

cindyyuanjiang

Thanks @amahussein for this refactor!

amahussein added the core_tools Scope the core module (scala) label Dec 19, 2024

amahussein self-assigned this Dec 19, 2024

amahussein requested review from nartal1, parthosa and cindyyuanjiang December 19, 2024 18:57

amahussein mentioned this pull request Dec 19, 2024

[BUG] Investigate long execution time of eventlogs #1461

Closed

remove empty lines

c7a4dbb

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

parthosa approved these changes Dec 19, 2024

View reviewed changes

cindyyuanjiang approved these changes Dec 20, 2024

View reviewed changes

cindyyuanjiang merged commit 3db52ef into NVIDIA:dev Dec 20, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve implementation of finding median in StatisticsMetrics #1474

Improve implementation of finding median in StatisticsMetrics #1474

amahussein commented Dec 19, 2024

parthosa left a comment •

edited

Loading

parthosa Dec 19, 2024

parthosa Dec 19, 2024

cindyyuanjiang left a comment

Improve implementation of finding median in StatisticsMetrics #1474

Improve implementation of finding median in StatisticsMetrics #1474

Conversation

amahussein commented Dec 19, 2024

Impact on performance:

Main optimization

Impact on output

Improvements to Collection Transformations:

Introduction of New Methods in StatisticsMetrics:

Refactoring and Cleanup:

parthosa left a comment • edited Loading

Choose a reason for hiding this comment

parthosa Dec 19, 2024

Choose a reason for hiding this comment

parthosa Dec 19, 2024

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

Introduction of New Methods in `StatisticsMetrics`:

parthosa left a comment •

edited

Loading