[BUG] AutoTuner recommendation for spark.sql.shuffle.partitions is not accurate #575

mattahrens · 2023-09-19T20:14:53Z

The current logic for the AutoTuner recommendation for spark.sql.shuffle.partitions is a simple heuristic based on spill metrics. Code ref: https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/AutoTuner.scala#L796-L817.

We should enhance the recommendation logic to include more factors to improve accuracy. Options for additional factors include GC time, data size, data skew, etc.

revans2 · 2023-10-24T14:28:05Z

We should also look at the stage that the spill happened in. Mainly if the spill only happened on a stage that does a parquet read, then increasing the number of shuffle partitions will have no impact, except to slow down the processing. We woudl want to potentially reduce the max partition bytes in those cases.

kuhushukla · 2024-01-04T15:54:42Z

Yes, I think the first step is to make the recommendations for partitions only when shuffle stages are involved. Because otherwise the recommendation is misleading. I think for GC Time we should o percentage of total task time , for example, if we are spending > 25% of the task time in GC (this is just a number) then we can use that as an indicator for config change, again only if shuffle stages are involved. in my opinion we should tackle data skew in a way where we can highlight it instead of just a config recommendation because something there is not much you can do when skew is present. Additionally, increasing shuffle partitions will help only the stages that are dominator based on time spent.

mattahrens added bug Something isn't working ? - Needs Triage core_tools Scope the core module (scala) and removed ? - Needs Triage labels Sep 19, 2023

mattahrens assigned cindyyuanjiang Nov 16, 2023

cindyyuanjiang mentioned this issue Jan 10, 2024

Consider additional factors in spark.sql.shuffle.partitions recommendation in Autotuner #722

Merged

cindyyuanjiang closed this as completed in #722 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AutoTuner recommendation for spark.sql.shuffle.partitions is not accurate #575

[BUG] AutoTuner recommendation for spark.sql.shuffle.partitions is not accurate #575

mattahrens commented Sep 19, 2023

revans2 commented Oct 24, 2023

kuhushukla commented Jan 4, 2024

[BUG] AutoTuner recommendation for spark.sql.shuffle.partitions is not accurate #575

[BUG] AutoTuner recommendation for spark.sql.shuffle.partitions is not accurate #575

Comments

mattahrens commented Sep 19, 2023

revans2 commented Oct 24, 2023

kuhushukla commented Jan 4, 2024