-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve shuffle manager recommendation in AutoTuner with version validation #1483
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case an unsupported Spark version is detected then ShuffleManagerResolver returns an appropriate error comment.
"unsupported Spark version" I believe we talk here about the spark-version extracted from the eventlog.
Does this mean that we are going to show an error whenever an eventlog is parsed with a spark version LT 3.2.0
? Keeping in mind that it is common to run Qualification on old eventlogs (i.e., Spark2.x), then we are not going to show any recommendations.
The AutoTuner needs to maintain 2 different spark-versions:
- actual spark-version: which is extracted from eventlog info.
- target spark-version: this should be the target spark-version that will run the GPU.
If the eventlog has an old spark version, AutoTuner should map it to a supported version. A discussion here is needed to decide how this done. For example should we always map to the most recent version for all the platforms, or should we do that only if the spark-version is not supported.
* - This can be extended to support more version mappings (e.g. Cloudera). | ||
*/ | ||
// scalastyle:on line.size.limit | ||
object ShuffleManagerResolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda platform specific thing, right. Why cannot we have it in the Platform class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will refactor this to make it platform specific.
Yes, since the plugin no longer supports Spark versions LT 3.2.0, the latest versions of the plugin will not provide a RAPIDS Shuffle Manager for these older Spark versions.
I think my description was incorrect. This change does not remove all recommendations. In the tuning file, the recommendation for
it now displays:
I think we should recommend upgrading the Spark version only when it is not supported. It may be disruptive to always recommend upgrading to the latest Spark version. |
I agree with that. We should confirm with @mattahrens and Felix which behavior they are expecting from the AutoTuner. |
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
I think instead of recommending a specific upgraded version of Spark, we should allow the user to choose the Spark version themselves (as there may be other concerns when upgrading their Spark version). I have updated the comment to specify the configuration required to enable the RAPIDS Shuffle Manager when an unsupported Spark version is detected. This gives users an actionable next step:
|
Signed-off-by: Partho Sarthi <[email protected]>
In general, I agree that we want to provide as detailed of recommendations as possible. In this case, since they have an old version of Spark that we don't support and we aren't sure what specific version of Spark they will support, a comment is fine that says we can't recommend the version they have. The comment would expand to say an example setting for the shuffle manager for a supported version (e.g. Spark 3.4.1) so they would know how to do it themselves when they upgrade their Spark version. |
This reverts commit 2a84eda.
Thanks @mattahrens and @amahussein for the feedback. I also discussed this with Felix. He agreed that the primary goal should be to provide users with an initial configuration to get something running. He agreed with Matt’s idea to expand the comment to include an example configuration for a supported version. Follow-up Questions from our discussion:
Proposed Actions:
For Non-Databricks:
For Databricks:
|
Signed-off-by: Partho Sarthi <[email protected]>
Thanks @parthosa !
The Qual's autotuner is supposed to combine the configurations and show it in a single file. That was a feature request by customers. |
Yes, it will be missing from the final config file with the combined configurations. I observed that there are some variables in the combined configurations file -
Can we include our shuffle manager configuration in that file using a similar approach?
I am doubtful about recommending the exact version as upgrading Spark involves several factors and I think there are concerns in both cases:
Also, I think the users might not be able to use the combined file directly due to other reasons as well:
and in the combined configuration file, we are not showing the expected plugin JAR
This one seems to be a major issue since none of the configuration would work if the JAR is not provided. Should we file an issue to fix this? |
Good point. The combined configuration was released as a POC and we had plans to re-iterate to improve it.
|
Thanks, @amahussein, for sharing your thoughts. The discussion above introduced new ideas to improve the AutoTuner output. We should iterate to improve combined configs output. For this PR, we could recommend configurations for the closest supported Spark version. However, we should explicitly highlight that the generated AutoTuner configurations are invalid until Spark is upgraded. One idea could be to use a label (e.g., Proposed AutoTuner's Output
|
Thanks @parthosa ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa
We can proceed with this PR for now and leave the handling of missing features as a separate task/followup.
The objective is to make the output of the AutoTuner almost ready for initial migrated jobs. At least, it should be easy to pull all those missing information from the combined config. Otherwise, the users have to look at 2 different files and figure out if any feature was omitted and how that impacts their job
Yes, @amahussein, that makes sense. I have created an issue #1489 to improve the AutoTuner to generate functional configuration files that are ready for direct use. |
Fixes #1402
This PR improves the
AutoTuner
module by adding a newShuffleManagerResolver
object to better handle the resolution of the appropriateRapidsShuffleManager
class name based on the Spark or Databricks version. In case an unsupported Spark version is detected thenShuffleManagerResolver
returns an appropriate error comment.Output Example for Unsupported Spark Version
CMD:
Output:
File:
qual_2025xxx/rapids_4_spark_qualification_output/tuning/application_xxx.log
Contents:
For Non-Databricks:
For Databricks:
Changes
Enhancements to Shuffle Manager Resolution:
ShuffleManagerResolver
object to determine the appropriateRapidsShuffleManager
class name based on Spark or Databricks version mappings. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala
)Updates to
AutoTuner
Class:getShuffleManagerClassName
method to useShuffleManagerResolver
for resolving the class name and returning either the class name or an error message. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala
)AutoTuner
class to handle the newEither
type returned bygetShuffleManagerClassName
. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala
)Tests:
ProfilingAutoTunerSuite
to verify the recommended shuffle manager version for various supported and unsupported Spark and Databricks versions. (core/src/test/scala/com/nvidia/spark/rapids/tool/tuning/ProfilingAutoTunerSuite.scala
) [1] [2] [3]