Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve shuffle manager recommendation in AutoTuner with version validation #1483

Merged
merged 9 commits into from
Jan 7, 2025

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Jan 2, 2025

Fixes #1402

This PR improves the AutoTuner module by adding a new ShuffleManagerResolver object to better handle the resolution of the appropriate RapidsShuffleManager class name based on the Spark or Databricks version. In case an unsupported Spark version is detected then ShuffleManagerResolver returns an appropriate error comment.

Output Example for Unsupported Spark Version

CMD:

spark_rapids qualification --platform <platform> --eventlogs <any_spark_313_eventlog> --tools_jar <tools_jar>

Output:
File: qual_2025xxx/rapids_4_spark_qualification_output/tuning/application_xxx.log
Contents:

For Non-Databricks:

- Cannot recommend RAPIDS Shuffle Manager for unsupported Spark version: '3.1.3'.
  To enable RAPIDS Shuffle Manager, use a supported Spark version (e.g., '3.5.1')
  and set: '--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark351.RapidsShuffleManager'.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.

For Databricks:

- Cannot recommend RAPIDS Shuffle Manager for unsupported Databricks runtime: '10.4.x-scala2.12'.
  To enable RAPIDS Shuffle Manager, use a supported Databricks runtime (e.g., '13.3')
  and set: '--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark341db.RapidsShuffleManager'.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.

Changes

Enhancements to Shuffle Manager Resolution:

  • Added ShuffleManagerResolver object to determine the appropriate RapidsShuffleManager class name based on Spark or Databricks version mappings. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala)

Updates to AutoTuner Class:

  • Modified getShuffleManagerClassName method to use ShuffleManagerResolver for resolving the class name and returning either the class name or an error message. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala)
  • Updated the logic in the AutoTuner class to handle the new Either type returned by getShuffleManagerClassName. (core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala)

Tests:

  • Added helper methods and new test cases in ProfilingAutoTunerSuite to verify the recommended shuffle manager version for various supported and unsupported Spark and Databricks versions. (core/src/test/scala/com/nvidia/spark/rapids/tool/tuning/ProfilingAutoTunerSuite.scala) [1] [2] [3]

@parthosa parthosa added bug Something isn't working core_tools Scope the core module (scala) labels Jan 2, 2025
@parthosa parthosa self-assigned this Jan 2, 2025
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa marked this pull request as ready for review January 3, 2025 16:50
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case an unsupported Spark version is detected then ShuffleManagerResolver returns an appropriate error comment.

"unsupported Spark version" I believe we talk here about the spark-version extracted from the eventlog.

Does this mean that we are going to show an error whenever an eventlog is parsed with a spark version LT 3.2.0 ? Keeping in mind that it is common to run Qualification on old eventlogs (i.e., Spark2.x), then we are not going to show any recommendations.

The AutoTuner needs to maintain 2 different spark-versions:

  1. actual spark-version: which is extracted from eventlog info.
  2. target spark-version: this should be the target spark-version that will run the GPU.

If the eventlog has an old spark version, AutoTuner should map it to a supported version. A discussion here is needed to decide how this done. For example should we always map to the most recent version for all the platforms, or should we do that only if the spark-version is not supported.

* - This can be extended to support more version mappings (e.g. Cloudera).
*/
// scalastyle:on line.size.limit
object ShuffleManagerResolver {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda platform specific thing, right. Why cannot we have it in the Platform class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will refactor this to make it platform specific.

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 3, 2025

Does this mean that we are going to show an error whenever an eventlog is parsed with a spark version LT 3.2.0 ?

Yes, since the plugin no longer supports Spark versions LT 3.2.0, the latest versions of the plugin will not provide a RAPIDS Shuffle Manager for these older Spark versions.

Keeping in mind that it is common to run Qualification on old eventlogs (i.e., Spark2.x), then we are not going to show any recommendations.

I think my description was incorrect. This change does not remove all recommendations. In the tuning file, the recommendation for spark.shuffle.manager is simply replaced. Instead of:

--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark311.RapidsShuffleManager

it now displays:

- Cannot recommend RAPIDS Shuffle Manager for unsupported '3.1.1' version.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.

If the eventlog has an old spark version, AutoTuner should map it to a supported version. A discussion here is needed to decide how this done. For example should we always map to the most recent version for all the platforms, or should we do that only if the spark-version is not supported.

I think we should recommend upgrading the Spark version only when it is not supported. It may be disruptive to always recommend upgrading to the latest Spark version.

@amahussein
Copy link
Collaborator

amahussein commented Jan 3, 2025

I think we should recommend upgrading the Spark version only when it is not supported. It may be disruptive to always recommend upgrading to the latest Spark version.

I agree with that.
However, IMHO the main focus of the AutoTuner is to provide the customers with an initial config to get something running. This is an important requirement to the success of the process.
Omitting those recommendations means that the users have to reach out back to resolve the blanks. That makes the process more difficult.

We should confirm with @mattahrens and Felix which behavior they are expecting from the AutoTuner.

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 3, 2025

However, IMHO the main focus of the AutoTuner is to provide the customers with an initial config to get something running. This is an important requirement to the success of the process.

I think instead of recommending a specific upgraded version of Spark, we should allow the user to choose the Spark version themselves (as there may be other concerns when upgrading their Spark version).

I have updated the comment to specify the configuration required to enable the RAPIDS Shuffle Manager when an unsupported Spark version is detected. This gives users an actionable next step:

- Cannot recommend RAPIDS Shuffle Manager for unsupported '3.1.3' version.
  To enable RAPIDS Shuffle Manager, use a supported Spark version and set
  'spark.shuffle.manager' to a valid RAPIDS Shuffle Manager version.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.

Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa requested a review from amahussein January 3, 2025 23:38
@mattahrens
Copy link
Collaborator

mattahrens commented Jan 6, 2025

In general, I agree that we want to provide as detailed of recommendations as possible. In this case, since they have an old version of Spark that we don't support and we aren't sure what specific version of Spark they will support, a comment is fine that says we can't recommend the version they have. The comment would expand to say an example setting for the shuffle manager for a supported version (e.g. Spark 3.4.1) so they would know how to do it themselves when they upgrade their Spark version.

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 6, 2025

Thanks @mattahrens and @amahussein for the feedback.

I also discussed this with Felix. He agreed that the primary goal should be to provide users with an initial configuration to get something running. He agreed with Matt’s idea to expand the comment to include an example configuration for a supported version.

Follow-up Questions from our discussion:

  • Can we have multiple tuning output files for multiple Spark versions?
    • I think it would make more sense to have version specific files if we include more version-specific recommendations. Currently, only the shuffle manager recommendation is Spark version-specific.
  • Can we update the official documentation page with an example configuration?
    • I will file an internal issue to update the documentation with this information.

Proposed Actions:

  1. Update the comment as follows (using the latest supported version for the sample):

For Non-Databricks:

- Cannot recommend RAPIDS Shuffle Manager for unsupported Spark version: '3.1.3'.
  To enable RAPIDS Shuffle Manager, use a supported Spark version (e.g., '3.5.1')
  and set: '--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark351.RapidsShuffleManager'.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.

For Databricks:

- Cannot recommend RAPIDS Shuffle Manager for unsupported Databricks runtime: '10.4.x-scala2.12'.
  To enable RAPIDS Shuffle Manager, use a supported Databricks runtime (e.g., '13.3')
  and set: '--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark341db.RapidsShuffleManager'.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.
  1. File an internal issue to update the documentation with an example configuration.

@amahussein
Copy link
Collaborator

Thanks @parthosa !

I also discussed this with Felix. He agreed that the primary goal should be to provide users with an initial configuration to get something running. He agreed with Matt’s idea to expand the comment to include an example configuration for a supported version.

The Qual's autotuner is supposed to combine the configurations and show it in a single file. That was a feature request by customers.
The current behavior is going to give us hard time statisfying that request. Since the recommendation only appears in the comment (i.e., verbose/explanation), this implies that the final config file from the Qual's AutoTuner will be missing it, right?

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 6, 2025

this implies that the final config file from the Qual's AutoTuner will be missing it, right?

Yes, it will be missing from the final config file with the combined configurations.

I observed that there are some variables in the combined configurations file -

--conf spark.executorEnv.PYTHONPATH={{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip
--conf spark.master=yarn
--conf spark.metrics.namespace=app_name:${spark.app.name}.app_id:${spark.app.id}

Can we include our shuffle manager configuration in that file using a similar approach?

--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark{{SUPPORTED_SHUFFLE_MANAGER_VER}}.RapidsShuffleManager

I am doubtful about recommending the exact version as upgrading Spark involves several factors and I think there are concerns in both cases:

  • Recommending the next supported version: For example, if customers are on Spark 3.1.1, they might not want to upgrade to the next supported version Spark 3.2.0 (which is old now) but instead prefer a more stable recent version (say Spark 3.5.0).
  • Recommending the latest supported version: Now, if customers are using Spark 3.4.4 and we introduce support for Spark 4.0.0 in the coming months, they might avoid upgrading to the latest Spark 4.0.0, as it is relatively new and unstable and instead prefer the next supported version stable version (say 3.5.0).

Also, I think the users might not be able to use the combined file directly due to other reasons as well:

  • In some cases the combined config file will contain redacted configs.
    Example,
--conf spark.databricks.cloudfetch.requestDownloadUrlsWithHeaders=*********(redacted)
--conf spark.databricks.cloudfetch.requesterClassName=*********(redacted)
  • In some cases, one of the AutoTuner's recommended comment is
- RAPIDS Accelerator for Apache Spark jar is missing in "spark.plugins". Please refer to https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html

and in the combined configuration file, we are not showing the expected plugin JAR

--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin

This one seems to be a major issue since none of the configuration would work if the JAR is not provided. Should we file an issue to fix this?

@amahussein
Copy link
Collaborator

this implies that the final config file from the Qual's AutoTuner will be missing it, right?

Yes, it will be missing from the final config file with the combined configurations.

I observed that there are some variables in the combined configurations file -

--conf spark.executorEnv.PYTHONPATH={{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip
--conf spark.master=yarn
--conf spark.metrics.namespace=app_name:${spark.app.name}.app_id:${spark.app.id}

Can we include our shuffle manager configuration in that file using a similar approach?

--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark{{SUPPORTED_SHUFFLE_MANAGER_VER}}.RapidsShuffleManager

I am doubtful about recommending the exact version as upgrading Spark involves several factors and I think there are concerns in both cases:

  • Recommending the next supported version: For example, if customers are on Spark 3.1.1, they might not want to upgrade to the next supported version Spark 3.2.0 (which is old now) but instead prefer a more stable recent version (say Spark 3.5.0).
  • Recommending the latest supported version: Now, if customers are using Spark 3.4.4 and we introduce support for Spark 4.0.0 in the coming months, they might avoid upgrading to the latest Spark 4.0.0, as it is relatively new and unstable and instead prefer the next supported version stable version (say 3.5.0).

Also, I think the users might not be able to use the combined file directly due to other reasons as well:

  • In some cases the combined config file will contain redacted configs.
    Example,
--conf spark.databricks.cloudfetch.requestDownloadUrlsWithHeaders=*********(redacted)
--conf spark.databricks.cloudfetch.requesterClassName=*********(redacted)
  • In some cases, one of the AutoTuner's recommended comment is
- RAPIDS Accelerator for Apache Spark jar is missing in "spark.plugins". Please refer to https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html

and in the combined configuration file, we are not showing the expected plugin JAR

--conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin

This one seems to be a major issue since none of the configuration would work if the JAR is not provided. Should we file an issue to fix this?

Good point. The combined configuration was released as a POC and we had plans to re-iterate to improve it.
I will document here some of my thoughts and we can always discuss offline.

  • Redacted values: Spark eventlogs has redactions. In addition, the AutoTuner also redacts some values just in case users would be concerned that this file is stored in the Tools output. If not redacted, there will be security concerns that some sensitive data is stored on disk and the tools output folder won't be shared with Nvidia engineers.
  • CSP configurations: some configs are set by the CSP environments without user's involvement. The auto-combiner needs to exclude those from the final output.
  • Missing key configurations:
    • My concern with omitting recommendations like the one in this PR is that the user has no way to tell that his combined config is missing something.
    • Spotting those configs needs someone who has experience with RAPIDS plugin and the spark-environment. It requires someone who knows what to look for.
    • To remedy that we might need to find a way to divide the combined config file into sections while still allow the user to use it programmatically if necessary.
  • Adding variables in AutoTuner output:
    • that's a good suggestion. In order to work, we need to make standard. Otherwise, there will be no way to grab those variables and set them programmatically.
  • Spark versions:
    • Moving forward, I can see that the shuffleManagerClass won't be the last key-recommendation to suffer from Spark-versions. RAPIDS plugin is going to drop older-Spark releases; especially with DB environment with quite often releases. That's why I believe that we need to decide how to handle that so that our next iterations will be smoother.
    • Suggesting different spark version is problematic, not only because of the GPU configs. For CPU configs, there are some key differences between configurations. Therefore, it is very challenging.
    • there are some ideas we can do to avoid being aggressive in recommending Spark versions. For example:
      • we can set "target-spark" that can be overridden by the user
      • not suggest upgrade of major release (i.e., spark 3.x are not getting configuration of Spark-4.00). Spark2.x users are an exception but those users have many other API changes that they need to deal anyway.
      • map spark to the closest supported. spark 3.1.x can be mapped to spark-3.2.x which less invasive

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 7, 2025

Thanks, @amahussein, for sharing your thoughts. The discussion above introduced new ideas to improve the AutoTuner output. We should iterate to improve combined configs output.

For this PR, we could recommend configurations for the closest supported Spark version. However, we should explicitly highlight that the generated AutoTuner configurations are invalid until Spark is upgraded. One idea could be to use a label (e.g., [UNSUPPORTED SPARK VERSION]). Now, we could sort the comments and comments with labels would be shown first. Let me know your thoughts on this.

Proposed AutoTuner's Output

### Recommended SPARK Configuration on GPU Cluster for App: application_1717059791409_0015 ###

Spark Properties:
..<other configs>..
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark320.RapidsShuffleManager
..<other configs>..

Comments:
- [UNSUPPORTED SPARK VERSION] Cannot recommend RAPIDS Shuffle Manager for unsupported Spark version: '3.1.3'.
  To use this feature, please upgrade to a supported Spark version.  
  Recommended configurations for the closest supported Spark version '3.2.0' have been generated above.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.
..<other comments>..

@amahussein
Copy link
Collaborator

Thanks, @amahussein, for sharing your thoughts. The discussion above introduced new ideas to improve the AutoTuner output. We should iterate to improve combined configs output.

For this PR, we could recommend configurations for the closest supported Spark version. However, we should explicitly highlight that the generated AutoTuner configurations are invalid until Spark is upgraded. One idea could be to use a label (e.g., [UNSUPPORTED SPARK VERSION]). Now, we could sort the comments and comments with labels would be shown first. Let me know your thoughts on this.

Proposed AutoTuner's Output

### Recommended SPARK Configuration on GPU Cluster for App: application_1717059791409_0015 ###

Spark Properties:
..<other configs>..
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark320.RapidsShuffleManager
..<other configs>..

Comments:
- [UNSUPPORTED SPARK VERSION] Cannot recommend RAPIDS Shuffle Manager for unsupported Spark version: '3.1.3'.
  To use this feature, please upgrade to a supported Spark version.  
  Recommended configurations for the closest supported Spark version '3.2.0' have been generated above.
  See supported versions: https://docs.nvidia.com/spark-rapids/user-guide/latest/additional-functionality/rapids-shuffle.html#rapids-shuffle-manager.
..<other comments>..

Thanks @parthosa !
It is fine to keep the PR as it is. Then we can do a followup that targets the points we have discussed.
The point is that we don't want to improve the autotuner while navigating away from the original behavior or the existing features that we already have in place (i.e., the auto-combiner).

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa
We can proceed with this PR for now and leave the handling of missing features as a separate task/followup.
The objective is to make the output of the AutoTuner almost ready for initial migrated jobs. At least, it should be easy to pull all those missing information from the combined config. Otherwise, the users have to look at 2 different files and figure out if any feature was omitted and how that impacts their job

@parthosa
Copy link
Collaborator Author

parthosa commented Jan 7, 2025

Yes, @amahussein, that makes sense. I have created an issue #1489 to improve the AutoTuner to generate functional configuration files that are ready for direct use.

@parthosa parthosa merged commit dccf8b8 into NVIDIA:dev Jan 7, 2025
13 checks passed
@parthosa parthosa deleted the spark-rapids-tools-1402 branch January 7, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fix shuffle manager recommendation
3 participants