You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should enhance the recommendation logic to include more factors to improve accuracy. Options for additional factors include GC time, data size, data skew, etc.
The text was updated successfully, but these errors were encountered:
We should also look at the stage that the spill happened in. Mainly if the spill only happened on a stage that does a parquet read, then increasing the number of shuffle partitions will have no impact, except to slow down the processing. We woudl want to potentially reduce the max partition bytes in those cases.
Yes, I think the first step is to make the recommendations for partitions only when shuffle stages are involved. Because otherwise the recommendation is misleading. I think for GC Time we should o percentage of total task time , for example, if we are spending > 25% of the task time in GC (this is just a number) then we can use that as an indicator for config change, again only if shuffle stages are involved. in my opinion we should tackle data skew in a way where we can highlight it instead of just a config recommendation because something there is not much you can do when skew is present. Additionally, increasing shuffle partitions will help only the stages that are dominator based on time spent.
The current logic for the AutoTuner recommendation for spark.sql.shuffle.partitions is a simple heuristic based on spill metrics. Code ref: https://github.com/NVIDIA/spark-rapids-tools/blob/dev/core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/AutoTuner.scala#L796-L817.
We should enhance the recommendation logic to include more factors to improve accuracy. Options for additional factors include GC time, data size, data skew, etc.
The text was updated successfully, but these errors were encountered: