Add Arguments for Distributed Mode in Qualification Tool CLI #1429

parthosa · 2024-11-18T19:12:05Z

This PR adds the initial changes needed in CLI to support distributed execution in the Qualification Tool CLI. It adds arguments to enable distributed mode and sets the stage for future implementation PRs.

Note:

An environment setup document will be shared internally.

Changes Overview

Extended RapidsJob: Introduced two subclasses—RapidsDistributedJob and RapidsLocalJob and a concrete class for the OnPrem platform.
Created a JarCmdArgs class to encapsulate all arguments needed to construct the JAR command.
Implemented the DistributedToolsConfig class, allowing configurations for distributed tools (like Spark properties) to be specified via the existing --tools_config_file option.

CMD:

spark_rapids qualification --platform onprem --eventlogs /path/to/eventlogs  --verbose --filter_apps all \
 --submission_mode distributed --tools_config_file /path/to/custom_conf_file.yaml

Sample Config File:

api_version: '1.1'
runtime:
  dependencies:
    - name: my-spark350
      uri: https:///archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
      dependency_type:
        dep_type: archive
        # for tgz files, it is required to give the subfolder where the jars are located
        relative_path: jars/*
submission:
  remote_cache_dir: 'hdfs:///tmp/spark_rapids_distributed_tools_cache'
  spark_properties:
    - name: 'spark.executor.memory'
      value: '20g'

Details:

user_tools/src/spark_rapids_pytools/cloud_api/onprem.py: Added a new class OnPremDistributedRapidsJob and a method create_distributed_submission_job to support distributed RAPIDS jobs. [1] [2]
user_tools/src/spark_rapids_pytools/rapids/rapids_job.py: Introduced RapidsDistributedJob class and updated methods to handle distributed tool configurations. [1] [2] [3] [4]
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Added methods to get distributed tools configurations and submit distributed jobs. [1] [2]

Enhancements to argument processing:

user_tools/src/spark_rapids_pytools/rapids/qualification.py: Added methods to process distributed tools arguments. [1] [2]
user_tools/src/spark_rapids_tools/cmdli/argprocessor.py: Updated QualifyUserArgModel and build_tools_args to include distributed_tools_enabled. [1] [2]

Platform class updates:

user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py, databricks_azure.py, dataproc.py, dataproc_gke.py, emr.py: Disabled pylint warnings for abstract methods. [1] [2] [3] [4] [5]

Other improvements:

user_tools/src/spark_rapids_pytools/rapids/qualification.py: Added a check to ensure the DataFrame is not empty before accessing it.
user_tools/src/spark_rapids_tools/cmdli/tools_cli.py: Added a new parameter distributed to the qualification function.

Signed-off-by: Partho Sarthi <[email protected]>

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py

user_tools/tests/spark_rapids_tools_ut/resources/tools_config/sample-config-specification.json

user_tools/tests/spark_rapids_tools_ut/resources/tools_config/valid/tools_config_01.yaml

Signed-off-by: Partho Sarthi <[email protected]>

user_tools/src/spark_rapids_pytools/rapids/rapids_job.py

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py

cindyyuanjiang

thanks @parthosa! LGTM, just a few quick questions.

Signed-off-by: Partho Sarthi <[email protected]>

amahussein

Thanks @parthosa !
Good job!
I think that we can improve the config if we have "runtime" common for both submision modes.

amahussein · 2024-12-11T16:44:33Z

user_tools/src/spark_rapids_tools/configuration/tools_config.py

        description='Configuration related to the runtime environment of the tools.')

+    distributed_tools: Optional[DistributedToolsConfig] = Field(


Having runtime on the same level as distributed_tools makes it confusing.
One would expect that runtime is generic property that applies for all submission modes.
I suggest that we have a DistributedToolsRuntimeConfig that extends ToolsRuntimeConfig or we have an abstract RuntimeConfig that gets extended by local and Distributed implementation.
In that case the format of the file will be more consistent for both modes.

In this PR, we are adding more properties to the config file --tools_config_file. So a user can now provide both runtime and distributed_tools together.

Example,

api_version: '1.0' runtime: dependencies: - name: my-spark350 uri: https:///archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz dependency_type: dep_type: archive # for tgz files, it is required to give the subfolder where the jars are located relative_path: jars/* distributed_tools: hdfs_output_dir: 'hdfs:///tmp/spark_rapids_distributed_tools_cache' spark_properties: - name: 'spark.executor.memory' value: '20g'

I wanted to understand why do we need a separate runtime config for distributed tools?

From offline discussion, updated the config file format to be as follows:

api_version: '1.1' runtime: dependencies: - name: my-spark350 uri: https:///archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz dependency_type: dep_type: archive # for tgz files, it is required to give the subfolder where the jars are located relative_path: jars/* submission: remote_cache_dir: 'hdfs:///tmp/spark_rapids_distributed_tools_cache' spark_properties: - name: 'spark.executor.memory' value: '20g'

amahussein · 2024-12-11T16:48:16Z

user_tools/src/spark_rapids_tools/configuration/distributed_tools_config.py

+        examples=['hdfs:///path/to/output/dir']
+    )
+
+    spark_properties: List[SparkProperty] = Field(


Technically, spark_property configuration is not limited to the distributed mode.
For the localMode, there is potential use to have a customer set spark_property that can be loaded by the tools. For example, FileSystem related arguments to access the eventlogs.

That is an interesting point. I think we should be specific where this each configuration property will be applied (for example, in this case, it will be used for distributed tools mode).

If we intend to use Spark properties for additional purposes in the future, we could leverage the SparkProperty class to define a separate configuration property for that specific use case.

amahussein · 2024-12-11T16:53:28Z

user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py

@@ -31,6 +31,7 @@
 from spark_rapids_pytools.pricing.price_provider import SavingsEstimator


+# pylint: disable=abstract-method


QQ: What prompt that change in all the cloud_api classes??

Addition of method create_distributed_submission_job() in PlatformBase required all CSPs to implement this. Now, since CSPs do not support distributed mode now, we would have to implement this method in all CSPs modules with body as pass.

Currently, we use the above approach for methods such as:

def set_offline_cluster(self, cluster_args: dict = None): pass def validate_job_submission_args(self, submission_args: dict) -> dict: pass

However, I think there are pros and cons to this approach.

Pros: In each CSP class we are clear what is implemented, what is not.
Cons: It adds redundant code in all CSP classes.

By adding the pylint exception, it would not be mandatory for each CSP to define methods with body as pass. Let me know your thoughts on this.

From offline discussion, removed the disable rule for pylint and added create_distributed_submission_job() in each CSP.

user_tools/src/spark_rapids_pytools/cloud_api/onprem.py

amahussein · 2024-12-11T16:56:19Z

user_tools/src/spark_rapids_tools/configuration/distributed_tools_config.py

+    """Configuration class for distributed tools"""
+    hdfs_output_dir: str = Field(
+        description='HDFS output directory where the output data from the distributed '
+                    'tools will be stored.',


Along the same point of being generic:

description='Output directory where the output data from the distributed ' 'tools will be stored. Currently, it supports only HDFS.'

I should clarify this in the description. This is the intermediate output directory where each map task is going to write the output. This directory will always be in HDFS (even in case of CSPs).

Renamed to remote_cache_dir and updated the description

amahussein · 2024-12-11T16:56:51Z

user_tools/src/spark_rapids_tools/configuration/distributed_tools_config.py

+
+class DistributedToolsConfig(BaseModel):
+    """Configuration class for distributed tools"""
+    hdfs_output_dir: str = Field(


If we name it remote_output_dir, or output_dir, then it will be better for us when we enable other CSPs. We won't have to change the basic fields in the config to do so.

Similar to the other comment: I should clarify this in the description. This is the intermediate output directory where each map task is going to write the output. This directory will always be in HDFS (even in case of CSPs).

Signed-off-by: Partho Sarthi <[email protected]>

parthosa · 2024-12-14T00:36:53Z

Converted to draft to address review comments

Signed-off-by: Partho Sarthi <[email protected]>

parthosa · 2024-12-17T00:57:28Z

user_tools/src/spark_rapids_pytools/resources/databricks_aws-configs.json

@@ -28,8 +28,7 @@
                  "value": "a65839fbf1869f81a1632e09f415e586922e4f80"
                },
                "size": 962685
-              },


The "type": "jar" property did not conform to the RuntimeDependency type specification. Previously, we allowed extra keys to be included, which resulted in properties like type passing validation incorrectly.

amahussein

Thnaks @parthosa !
LGTME

cindyyuanjiang

thanks @parthosa! a few minor questions and nits

cindyyuanjiang · 2024-12-19T19:49:30Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+        Process the value provided by `--submission_mode` argument.
+        """
+        submission_mode_arg = self.wrapper_options.get('submissionMode')
+        if submission_mode_arg is None or not submission_mode_arg:


QQ: what does condition not submission_mode_arg catch? like an empty string?

Yes. It could be anything that python considers falsey (e.g. empty string)

cindyyuanjiang · 2024-12-19T19:52:30Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

@@ -185,7 +192,7 @@ def _process_custom_args(self) -> None:
        self._process_estimation_model_args()
        self._process_offline_cluster_args()
        self._process_eventlogs_args()
-        self._process_distributed_tools_args()
+        self._process_submission_mode_arg()
        # This is noise to dump everything
        # self.logger.debug('%s custom arguments = %s', self.pretty_name(), self.ctxt.props['wrapperCtx'])


not related to this PR, interesting to see we have unused code here.

cindyyuanjiang · 2024-12-19T22:35:25Z

user_tools/tests/spark_rapids_tools_ut/conftest.py

@@ -49,6 +49,7 @@ def gen_cpu_cluster_props():
 autotuner_prop_path = 'worker_info.yaml'
 # valid tools config files
 valid_tools_conf_files = ['tools_config_00.yaml']
+valid_distributed_mode_tools_conf_files = ['tools_config_01.yaml', 'tools_config_02.yaml']


nit: are we planning to give these yaml files more meaningful names later?

Yes @cindyyuanjiang. Going forward we can rename this config files to be more meaningful.

parthosa added 6 commits November 16, 2024 19:11

Add arguments for running tools in distributed mode

5dea85d

Signed-off-by: Partho Sarthi <[email protected]>

Refactor to use tools config file

e629799

Signed-off-by: Partho Sarthi <[email protected]>

Update specification

55e1e06

Signed-off-by: Partho Sarthi <[email protected]>

Update tools config file

33a4841

Signed-off-by: Partho Sarthi <[email protected]>

Update comment

ad88b94

Signed-off-by: Partho Sarthi <[email protected]>

Add pylint exception

d0a3ec0

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Nov 18, 2024

parthosa self-assigned this Nov 18, 2024

parthosa mentioned this pull request Nov 18, 2024

[FEA] Distributed processing of Event Logs #1249

Open

parthosa requested review from tgravescs, cindyyuanjiang, amahussein and nartal1 November 18, 2024 21:46

parthosa marked this pull request as ready for review November 18, 2024 21:47

parthosa commented Nov 18, 2024

View reviewed changes

parthosa added 2 commits November 22, 2024 16:17

Include hdfs output dir in tools config file

04edb7d

Signed-off-by: Partho Sarthi <[email protected]>

Add comment about assumption of Spark JARs

5989d00

Signed-off-by: Partho Sarthi <[email protected]>

cindyyuanjiang reviewed Nov 25, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/rapids_job.py Show resolved Hide resolved

cindyyuanjiang reviewed Nov 25, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/qualification_stats_report.py Outdated Show resolved Hide resolved

cindyyuanjiang approved these changes Nov 25, 2024

View reviewed changes

Revert changes in stats report

117f987

Signed-off-by: Partho Sarthi <[email protected]>

cindyyuanjiang approved these changes Dec 3, 2024

View reviewed changes

amahussein reviewed Dec 11, 2024

View reviewed changes

Submission mode Args

b252e7f

Signed-off-by: Partho Sarthi <[email protected]>

parthosa marked this pull request as draft December 13, 2024 19:28

parthosa added 3 commits December 16, 2024 10:26

Modify the arguments structure

34d1dc1

Signed-off-by: Partho Sarthi <[email protected]>

Bump up the API version for tools config file

00584f6

Signed-off-by: Partho Sarthi <[email protected]>

Update python arg tests

9b8111a

Signed-off-by: Partho Sarthi <[email protected]>

Remove pylint disable rule in CSPs

1ea8147

Signed-off-by: Partho Sarthi <[email protected]>

parthosa commented Dec 17, 2024

View reviewed changes

parthosa marked this pull request as ready for review December 17, 2024 05:02

parthosa requested review from amahussein and cindyyuanjiang December 17, 2024 19:25

amahussein approved these changes Dec 18, 2024

View reviewed changes

cindyyuanjiang approved these changes Dec 19, 2024

View reviewed changes

parthosa merged commit 6c61e52 into NVIDIA:spark-rapids-tools-distributed-base Dec 20, 2024
14 checks passed

parthosa deleted the spark-rapids-tools-distributed-args-v2 branch December 20, 2024 01:06

parthosa linked an issue Dec 21, 2024 that may be closed by this pull request

[TASK] Add Arguments for Distributed Mode in Qualification Tool CLI #1430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Arguments for Distributed Mode in Qualification Tool CLI #1429

Add Arguments for Distributed Mode in Qualification Tool CLI #1429

parthosa commented Nov 18, 2024 •

edited

Loading

cindyyuanjiang left a comment

amahussein left a comment

amahussein Dec 11, 2024

parthosa Dec 11, 2024

parthosa Dec 17, 2024

amahussein Dec 11, 2024

parthosa Dec 11, 2024 •

edited

Loading

amahussein Dec 11, 2024

parthosa Dec 11, 2024 •

edited

Loading

parthosa Dec 17, 2024

amahussein Dec 11, 2024

parthosa Dec 11, 2024

parthosa Dec 17, 2024

amahussein Dec 11, 2024

parthosa Dec 11, 2024

parthosa commented Dec 14, 2024

parthosa Dec 17, 2024

amahussein left a comment

cindyyuanjiang left a comment

cindyyuanjiang Dec 19, 2024

parthosa Dec 20, 2024

cindyyuanjiang Dec 19, 2024

cindyyuanjiang Dec 19, 2024

parthosa Dec 20, 2024

		description='Configuration related to the runtime environment of the tools.')

		distributed_tools: Optional[DistributedToolsConfig] = Field(

		@@ -31,6 +31,7 @@
		from spark_rapids_pytools.pricing.price_provider import SavingsEstimator


		# pylint: disable=abstract-method

Add Arguments for Distributed Mode in Qualification Tool CLI #1429

Add Arguments for Distributed Mode in Qualification Tool CLI #1429

Conversation

parthosa commented Nov 18, 2024 • edited Loading

Changes Overview

Details:

cindyyuanjiang left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa commented Dec 14, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa commented Nov 18, 2024 •

edited

Loading

parthosa Dec 11, 2024 •

edited

Loading

parthosa Dec 11, 2024 •

edited

Loading