Add diagnostic capabilities for Databricks (AWS/Azure) environments (#…

…533) * Add support for Diagnostic tool in Databricks AWS Signed-off-by: Partho Sarthi <[email protected]> * Fix bugs in `collect.sh` for EMR Signed-off-by: Partho Sarthi <[email protected]> * Add changes for Databricks-Azure and fix linting Signed-off-by: Partho Sarthi <[email protected]> * Add ssh methods for DB-Azure and fix DB-AWS changes Signed-off-by: Partho Sarthi <[email protected]> * Add mock cluster info for unit testing Signed-off-by: Partho Sarthi <[email protected]> * Add mock methods for Databricks Azure tests Signed-off-by: Partho Sarthi <[email protected]> * Update docs Signed-off-by: Partho Sarthi <[email protected]> * Add mew line at end of json Signed-off-by: Partho Sarthi <[email protected]> * Fix databricks jars directory Signed-off-by: Partho Sarthi <[email protected]> * Handle sudo and refactor ssh method Signed-off-by: Partho Sarthi <[email protected]> * Fix errors due to dev merge Signed-off-by: Partho Sarthi <[email protected]> --------- Signed-off-by: Partho Sarthi <[email protected]>
NVIDIA · Sep 21, 2023 · fd0541e · fd0541e
1 parent c479f27
commit fd0541e
Show file tree

Hide file tree

Showing 16 changed files with 502 additions and 59 deletions.
diff --git a/user_tools/docs/user-tools-databricks-aws.md b/user_tools/docs/user-tools-databricks-aws.md
@@ -307,3 +307,45 @@ The CLI is triggered by providing the location where the yaml file is stored `--
             --worker_info $WORKER_INFO_PATH \
             --remote_folder $REMOTE_FOLDER
     ```
+
+## Diagnostic command
+
+```
+spark_rapids_user_tools databricks-aws diagnostic [options]
+spark_rapids_user_tools databricks-aws diagnostic --help
+```
+
+Run diagnostic command to collects information from Databricks cluster, such as OS version, # of worker
+nodes, Yarn configuration, Spark version and error logs etc. The cluster has to be running and the
+user must have SSH access.
+
+### Diagnostic options
+
+| Option            | Description                                                                                               | Default                                                                                     | Required |
+|-------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|:--------:|
+| **cluster**       | Name of the Databricks cluster running an accelerated computing instance                                  | N/A                                                                                         |    Y     |
+| **profile**       | A named Databricks profile that you can specify to get the settings/credentials of the Databricks account | "DEFAULT"                                                                                   |    N     |
+| **aws_profile**   | A named AWS profile that you can specify to get the settings/credentials of the AWS account               | "default" if the the env-variable `AWS_PROFILE` is not set                                  |    N     |
+| **output_folder** | Path to local directory where the final recommendations is logged                                         | env variable `RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY` if any; or the current working directory. |    N     |
+| **port**          | Port number to be used for the ssh connections                                                            | 2200                                                                                        |    N     |
+| **key_file**      | Path to the private key file to be used for the ssh connections.                                          | Default ssh key based on the OS                                                             |    N     |
+| **thread_num**    | Number of threads to access remote cluster nodes in parallel                                              | 3                                                                                           |    N     |
+| **yes**           | auto confirm to interactive question                                                                      | False                                                                                       |    N     |
+| **verbose**       | True or False to enable verbosity to the wrapper script                                                   | False if `RAPIDS_USER_TOOLS_LOG_DEBUG` is not set                                           |    N     |
+
+### Info collection
+
+The default is to collect info from each cluster node via SSH access and an archive would be created
+to output folder at last.
+
+The steps to run the command:
+
+1. The user creates a cluster
+2. The user runs the following command:
+
+    ```bash
+    spark_rapids_user_tools databricks-aws diagnostic \
+      --cluster my-cluster-name
+    ```
+   
+If the connection to Databricks instances cannot be established through SSH, the command will raise error.
diff --git a/user_tools/docs/user-tools-databricks-azure.md b/user_tools/docs/user-tools-databricks-azure.md
@@ -308,3 +308,45 @@ The CLI is triggered by providing the location where the yaml file is stored `--
             --worker_info $WORKER_INFO_PATH \
             --remote_folder $REMOTE_FOLDER
     ```
+
+
+## Diagnostic command
+
+```
+spark_rapids_user_tools databricks-azure diagnostic [options]
+spark_rapids_user_tools databricks-azure diagnostic --help
+```
+
+Run diagnostic command to collects information from Databricks cluster, such as OS version, # of worker
+nodes, Yarn configuration, Spark version and error logs etc. The cluster has to be running and the
+user must have SSH access.
+
+### Diagnostic options
+
+| Option            | Description                                                                                               | Default                                                                                     | Required |
+|-------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|:--------:|
+| **cluster**       | Name of the Databricks cluster running an accelerated computing instance                                  | N/A                                                                                         |    Y     |
+| **profile**       | A named Databricks profile that you can specify to get the settings/credentials of the Databricks account | "DEFAULT"                                                                                   |    N     |
+| **output_folder** | Path to local directory where the final recommendations is logged                                         | env variable `RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY` if any; or the current working directory. |    N     |
+| **port**          | Port number to be used for the ssh connections                                                            | 2200                                                                                        |    N     |
+| **key_file**      | Path to the private key file to be used for the ssh connections.                                          | Default ssh key based on the OS                                                             |    N     |
+| **thread_num**    | Number of threads to access remote cluster nodes in parallel                                              | 3                                                                                           |    N     |
+| **yes**           | auto confirm to interactive question                                                                      | False                                                                                       |    N     |
+| **verbose**       | True or False to enable verbosity to the wrapper script                                                   | False if `RAPIDS_USER_TOOLS_LOG_DEBUG` is not set                                           |    N     |
+
+### Info collection
+
+The default is to collect info from each cluster node via SSH access and an archive would be created
+to output folder at last.
+
+The steps to run the command:
+
+1. The user creates a cluster
+2. The user runs the following command:
+
+    ```bash
+    spark_rapids_user_tools databricks-azure diagnostic \
+      --cluster my-cluster-name
+    ```
+   
+If the connection to Databricks instances cannot be established through SSH, the command will raise error.
diff --git a/user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py b/user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py
@@ -26,6 +26,7 @@
     ClusterGetAccessor
 from spark_rapids_pytools.cloud_api.sp_types import ClusterState, SparkNodeType
 from spark_rapids_pytools.common.prop_manager import JSONPropertiesContainer
+from spark_rapids_pytools.common.utilities import Utils
 from spark_rapids_pytools.pricing.databricks_pricing import DatabricksPriceProvider
 from spark_rapids_pytools.pricing.price_provider import SavingsEstimator
 
@@ -121,6 +122,38 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
             return json.dumps(raw_prop_container.props)
         return cluster_described
 
+    def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['ssh',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-p {port}',
+                       f'ubuntu@{node.name}']
+        return Utils.gen_joined_str(' ', prefix_args)
+
+    def _build_cmd_scp_to_node(self, node: ClusterNode, src: str, dest: str) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['scp',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-P {port}',
+                       src,
+                       f'ubuntu@{node.name}:{dest}']
+        return Utils.gen_joined_str(' ', prefix_args)
+
+    def _build_cmd_scp_from_node(self, node: ClusterNode, src: str, dest: str) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['scp',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-P {port}',
+                       f'ubuntu@{node.name}:{src}',
+                       dest]
+        return Utils.gen_joined_str(' ', prefix_args)
+
     def _build_platform_describe_node_instance(self, node: ClusterNode) -> list:
         cmd_params = ['aws ec2 describe-instance-types',
                       '--region', f'{self.get_region()}',

diff --git a/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py b/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py
@@ -125,6 +125,38 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
             self.logger.error('Invalid arguments to pull the cluster properties')
         return self.run_sys_cmd(get_cluster_cmd)
 
+    def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['ssh',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-p {port}',
+                       f'ubuntu@{node.name}']
+        return Utils.gen_joined_str(' ', prefix_args)
+
+    def _build_cmd_scp_to_node(self, node: ClusterNode, src: str, dest: str) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['scp',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-P {port}',
+                       src,
+                       f'ubuntu@{node.name}:{dest}']
+        return Utils.gen_joined_str(' ', prefix_args)
+
+    def _build_cmd_scp_from_node(self, node: ClusterNode, src: str, dest: str) -> str:
+        port = self.env_vars.get('sshPort')
+        key_file = self.env_vars.get('sshKeyFile')
+        prefix_args = ['scp',
+                       '-o StrictHostKeyChecking=no',
+                       f'-i {key_file} ' if key_file else '',
+                       f'-P {port}',
+                       f'ubuntu@{node.name}:{src}',
+                       dest]
+        return Utils.gen_joined_str(' ', prefix_args)
+
     def process_instances_description(self, raw_instances_description: str) -> dict:
         processed_instances_description = {}
         instances_description = JSONPropertiesContainer(prop_arg=raw_instances_description, file_load=False)

diff --git a/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py b/user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py
@@ -213,7 +213,7 @@ def exec_platform_describe_accelerator(self,
                     self.get_env_var('zone')]
         return self.run_sys_cmd(cmd_args)
 
-    def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
+    def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
         pref_args = ['gcloud',
                      'compute', 'ssh',
                      node.name,

diff --git a/user_tools/src/spark_rapids_pytools/cloud_api/emr.py b/user_tools/src/spark_rapids_pytools/cloud_api/emr.py
@@ -158,7 +158,7 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
             return EMRPlatform.process_raw_cluster_prop(raw_prop_container)
         return cluster_described
 
-    def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
+    def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
         # get the pem file
         pem_file_path = self.env_vars.get('keyPairPath')
         prefix_args = ['ssh',

diff --git a/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py b/user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py
@@ -463,7 +463,7 @@ def append_to_cmd(original_cmd, extra_args: list) -> Any:
         sys_cmd = SysCmd().build(cmd_args)
         return sys_cmd.exec()
 
-    def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
+    def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
         del node  # Unused by super method.
         return ''
 
@@ -479,7 +479,7 @@ def _construct_ssh_cmd_with_prefix(self, prefix: str, remote_cmd: str) -> str:
         return f'{prefix} {remote_cmd}'
 
     def ssh_cmd_node(self, node: ClusterNode, ssh_cmd: str, cmd_input: str = None) -> str:
-        prefix_cmd = self._build_ssh_cmd_prefix_for_node(node=node)
+        prefix_cmd = self._build_cmd_ssh_prefix_for_node(node=node)
         full_ssh_cmd = self._construct_ssh_cmd_with_prefix(prefix=prefix_cmd, remote_cmd=ssh_cmd)
         return self.run_sys_cmd(full_ssh_cmd, cmd_input=cmd_input)
 

diff --git a/user_tools/src/spark_rapids_pytools/rapids/diagnostic.py b/user_tools/src/spark_rapids_pytools/rapids/diagnostic.py
@@ -91,7 +91,7 @@ def _collect_info(self, node):
         self._upload_scripts(node)
 
         remote_output_folder = self.ctxt.get_remote('outputFolder')
-        ssh_cmd = f'"PREFIX={remote_output_folder} /tmp/collect.sh"'
+        ssh_cmd = f'"PREFIX={remote_output_folder} PLATFORM_TYPE={self.platform_type} /tmp/collect.sh"'
 
         try:
             self.logger.info('Collecting info on node: %s', node.get_name())

diff --git a/user_tools/src/spark_rapids_pytools/resources/collect.sh b/user_tools/src/spark_rapids_pytools/resources/collect.sh
@@ -30,6 +30,13 @@ mkdir -p $TEMP_PATH
 NODE_ID=`hostname`
 OUTPUT_NODE_INFO="$TEMP_PATH/$HOSTNAME.info"
 
+if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
+    DATABRICKS_HOME='/databricks'
+    SPARK_HOME="$DATABRICKS_HOME/spark"
+else
+    SPARK_HOME='/usr/lib/spark'
+fi
+
 echo "[OS version]" >> $OUTPUT_NODE_INFO
 cat /etc/os-release >> $OUTPUT_NODE_INFO
 
@@ -48,11 +55,27 @@ free -h >> $OUTPUT_NODE_INFO
 
 echo "" >> $OUTPUT_NODE_INFO
 echo "[Network adapter]" >> $OUTPUT_NODE_INFO
+
 if command -v lshw ; then
     lshw -C network >> $OUTPUT_NODE_INFO
 else
-    # Downgrade to 'lspci' on EMR as it's not installed by default
-    /usr/sbin/lspci | grep 'Ethernet controller' >> $OUTPUT_NODE_INFO
+    # Downgrade to 'lspci'
+    if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
+        if command -v sudo ; then
+            sudo apt install -y pciutils
+        fi
+        if command -v lspci ; then
+            lspci | { grep 'Ethernet controller' || true; } >> $OUTPUT_NODE_INFO
+        else
+             echo 'not found' >> $OUTPUT_NODE_INFO
+        fi
+    elif [ "$PLATFORM_TYPE" == "emr" ]; then
+        if command -v /usr/sbin/lspci ; then
+            /usr/sbin/lspci | { grep 'Ethernet controller' || true; } >> $OUTPUT_NODE_INFO
+        else
+             echo 'not found' >> $OUTPUT_NODE_INFO
+        fi
+    fi
 fi
 
 echo "" >> $OUTPUT_NODE_INFO
@@ -64,8 +87,12 @@ echo "[GPU adapter]" >> $OUTPUT_NODE_INFO
 if command -v lshw ; then
     lshw -C display >> $OUTPUT_NODE_INFO
 else
-    # Downgrade to 'lspci' on EMR as it's not installed by default
-    /usr/sbin/lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
+    # Downgrade to 'lspci'
+    if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
+        lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
+    elif [ "$PLATFORM_TYPE" == "emr" ]; then
+        /usr/sbin/lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
+    fi
 fi
 
 echo "" >> $OUTPUT_NODE_INFO
@@ -82,19 +109,31 @@ java -version 2>> $OUTPUT_NODE_INFO
 
 echo "" >> $OUTPUT_NODE_INFO
 echo "[Spark version]" >> $OUTPUT_NODE_INFO
-spark-submit --version 2>> $OUTPUT_NODE_INFO
+
+if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
+    if [ -f $SPARK_HOME/VERSION ]; then
+        echo "$(cat "$SPARK_HOME/VERSION")" >> $OUTPUT_NODE_INFO
+    else
+        echo 'not found' >> $OUTPUT_NODE_INFO
+    fi
+else
+    if command -v $SPARK_HOME/bin/pyspark ; then
+        $SPARK_HOME/bin/pyspark --version 2>&1|grep -v Scala|awk '/version\ [0-9.]+/{print $NF}' >> $OUTPUT_NODE_INFO
+    else
+        echo 'not found' >> $OUTPUT_NODE_INFO
+    fi
+fi
 
 echo "" >> $OUTPUT_NODE_INFO
 echo "[Spark rapids plugin]" >> $OUTPUT_NODE_INFO
 
-if [ -z "$SPARK_HOME" ]; then
-    # Source spark env variables if not set $SPARK_HOME, e.g. on EMR node
-    source /etc/spark/conf/spark-env.sh
-fi
-
-if [ -f $SPARK_HOME/jars/rapids-4-spark*.jar ]; then
-    ls -l $SPARK_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
-elif [ -f /usr/lib/spark/jars/rapids-4-spark_n-0.jar ]; then
+if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
+    if [ -f $DATABRICKS_HOME/jars/rapids-4-spark*.jar ]; then
+        ls -l $DATABRICKS_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
+    else
+        echo 'not found' >> $OUTPUT_NODE_INFO
+    fi
+elif [ -f $SPARK_HOME/jars/rapids-4-spark*.jar ]; then
     ls -l $SPARK_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
 else
     echo 'not found' >> $OUTPUT_NODE_INFO
@@ -105,7 +144,12 @@ echo "[CUDA version]" >> $OUTPUT_NODE_INFO
 if [ -f /usr/local/cuda/version.json ]; then
     cat  /usr/local/cuda/version.json | grep '\<cuda\>' -A 2 | grep version >> $OUTPUT_NODE_INFO
 else
-    echo 'not found' >> $OUTPUT_NODE_INFO
+    # Fetch CUDA version from nvidia-smi
+    if command -v nvidia-smi ; then
+        nvidia-smi --query | grep "CUDA Version" | awk '{print $4}' >> $OUTPUT_NODE_INFO
+    else
+        echo 'not found' >> $OUTPUT_NODE_INFO
+    fi
 fi
 
 # Copy config files

diff --git a/user_tools/src/spark_rapids_pytools/resources/databricks_aws-configs.json b/user_tools/src/spark_rapids_pytools/resources/databricks_aws-configs.json
@@ -31,7 +31,7 @@
     },
   "environment": {
     "//description": "Define the metadata related to the system, prerequisites, and configurations",
-    "envParams": ["profile", "awsProfile", "deployMode"],
+    "envParams": ["profile", "awsProfile", "deployMode", "sshPort", "sshKeyFile"],
     "//initialConfigList": "represents the list of the configurations that need to be loaded first",
     "initialConfigList": ["profile", "awsProfile", "cliConfigFile", "awsCliConfigFile", "awsCredentialFile"],
     "//loadedConfigProps": "list of properties read by the configParser",

diff --git a/user_tools/src/spark_rapids_pytools/resources/databricks_azure-configs.json b/user_tools/src/spark_rapids_pytools/resources/databricks_azure-configs.json
@@ -23,7 +23,7 @@
     },
   "environment": {
     "//description": "Define the metadata related to the system, prerequisites, and configurations",
-    "envParams": ["profile", "deployMode", "azureRegionSection", "azureStorageSection"],
+    "envParams": ["profile", "deployMode", "azureRegionSection", "azureStorageSection", "sshPort", "sshKeyFile"],
     "//initialConfigList": "represents the list of the configurations that need to be loaded first",
     "initialConfigList": ["profile", "cliConfigFile", "azureCliConfigFile", "region", "azureRegionSection", "azureStorageSection"],
     "//loadedConfigProps": "list of properties read by the configParser",