Skip to content

Commit

Permalink
Add diagnostic capabilities for Databricks (AWS/Azure) environments (#…
Browse files Browse the repository at this point in the history
…533)

* Add support for Diagnostic tool in Databricks AWS

Signed-off-by: Partho Sarthi <[email protected]>

* Fix bugs in `collect.sh` for EMR

Signed-off-by: Partho Sarthi <[email protected]>

* Add changes for Databricks-Azure and fix linting

Signed-off-by: Partho Sarthi <[email protected]>

* Add ssh methods for DB-Azure and fix DB-AWS changes

Signed-off-by: Partho Sarthi <[email protected]>

* Add mock cluster info for unit testing

Signed-off-by: Partho Sarthi <[email protected]>

* Add mock methods for Databricks Azure tests

Signed-off-by: Partho Sarthi <[email protected]>

* Update docs

Signed-off-by: Partho Sarthi <[email protected]>

* Add mew line at end of json

Signed-off-by: Partho Sarthi <[email protected]>

* Fix databricks jars directory

Signed-off-by: Partho Sarthi <[email protected]>

* Handle sudo and refactor ssh method

Signed-off-by: Partho Sarthi <[email protected]>

* Fix errors due to dev merge

Signed-off-by: Partho Sarthi <[email protected]>

---------

Signed-off-by: Partho Sarthi <[email protected]>
  • Loading branch information
parthosa authored Sep 21, 2023
1 parent c479f27 commit fd0541e
Show file tree
Hide file tree
Showing 16 changed files with 502 additions and 59 deletions.
42 changes: 42 additions & 0 deletions user_tools/docs/user-tools-databricks-aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,3 +307,45 @@ The CLI is triggered by providing the location where the yaml file is stored `--
--worker_info $WORKER_INFO_PATH \
--remote_folder $REMOTE_FOLDER
```
## Diagnostic command
```
spark_rapids_user_tools databricks-aws diagnostic [options]
spark_rapids_user_tools databricks-aws diagnostic --help
```
Run diagnostic command to collects information from Databricks cluster, such as OS version, # of worker
nodes, Yarn configuration, Spark version and error logs etc. The cluster has to be running and the
user must have SSH access.
### Diagnostic options
| Option | Description | Default | Required |
|-------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|:--------:|
| **cluster** | Name of the Databricks cluster running an accelerated computing instance | N/A | Y |
| **profile** | A named Databricks profile that you can specify to get the settings/credentials of the Databricks account | "DEFAULT" | N |
| **aws_profile** | A named AWS profile that you can specify to get the settings/credentials of the AWS account | "default" if the the env-variable `AWS_PROFILE` is not set | N |
| **output_folder** | Path to local directory where the final recommendations is logged | env variable `RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY` if any; or the current working directory. | N |
| **port** | Port number to be used for the ssh connections | 2200 | N |
| **key_file** | Path to the private key file to be used for the ssh connections. | Default ssh key based on the OS | N |
| **thread_num** | Number of threads to access remote cluster nodes in parallel | 3 | N |
| **yes** | auto confirm to interactive question | False | N |
| **verbose** | True or False to enable verbosity to the wrapper script | False if `RAPIDS_USER_TOOLS_LOG_DEBUG` is not set | N |
### Info collection
The default is to collect info from each cluster node via SSH access and an archive would be created
to output folder at last.
The steps to run the command:
1. The user creates a cluster
2. The user runs the following command:
```bash
spark_rapids_user_tools databricks-aws diagnostic \
--cluster my-cluster-name
```
If the connection to Databricks instances cannot be established through SSH, the command will raise error.
42 changes: 42 additions & 0 deletions user_tools/docs/user-tools-databricks-azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,3 +308,45 @@ The CLI is triggered by providing the location where the yaml file is stored `--
--worker_info $WORKER_INFO_PATH \
--remote_folder $REMOTE_FOLDER
```
## Diagnostic command
```
spark_rapids_user_tools databricks-azure diagnostic [options]
spark_rapids_user_tools databricks-azure diagnostic --help
```
Run diagnostic command to collects information from Databricks cluster, such as OS version, # of worker
nodes, Yarn configuration, Spark version and error logs etc. The cluster has to be running and the
user must have SSH access.
### Diagnostic options
| Option | Description | Default | Required |
|-------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|:--------:|
| **cluster** | Name of the Databricks cluster running an accelerated computing instance | N/A | Y |
| **profile** | A named Databricks profile that you can specify to get the settings/credentials of the Databricks account | "DEFAULT" | N |
| **output_folder** | Path to local directory where the final recommendations is logged | env variable `RAPIDS_USER_TOOLS_OUTPUT_DIRECTORY` if any; or the current working directory. | N |
| **port** | Port number to be used for the ssh connections | 2200 | N |
| **key_file** | Path to the private key file to be used for the ssh connections. | Default ssh key based on the OS | N |
| **thread_num** | Number of threads to access remote cluster nodes in parallel | 3 | N |
| **yes** | auto confirm to interactive question | False | N |
| **verbose** | True or False to enable verbosity to the wrapper script | False if `RAPIDS_USER_TOOLS_LOG_DEBUG` is not set | N |
### Info collection
The default is to collect info from each cluster node via SSH access and an archive would be created
to output folder at last.
The steps to run the command:
1. The user creates a cluster
2. The user runs the following command:
```bash
spark_rapids_user_tools databricks-azure diagnostic \
--cluster my-cluster-name
```
If the connection to Databricks instances cannot be established through SSH, the command will raise error.
33 changes: 33 additions & 0 deletions user_tools/src/spark_rapids_pytools/cloud_api/databricks_aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
ClusterGetAccessor
from spark_rapids_pytools.cloud_api.sp_types import ClusterState, SparkNodeType
from spark_rapids_pytools.common.prop_manager import JSONPropertiesContainer
from spark_rapids_pytools.common.utilities import Utils
from spark_rapids_pytools.pricing.databricks_pricing import DatabricksPriceProvider
from spark_rapids_pytools.pricing.price_provider import SavingsEstimator

Expand Down Expand Up @@ -121,6 +122,38 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
return json.dumps(raw_prop_container.props)
return cluster_described

def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['ssh',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-p {port}',
f'ubuntu@{node.name}']
return Utils.gen_joined_str(' ', prefix_args)

def _build_cmd_scp_to_node(self, node: ClusterNode, src: str, dest: str) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['scp',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-P {port}',
src,
f'ubuntu@{node.name}:{dest}']
return Utils.gen_joined_str(' ', prefix_args)

def _build_cmd_scp_from_node(self, node: ClusterNode, src: str, dest: str) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['scp',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-P {port}',
f'ubuntu@{node.name}:{src}',
dest]
return Utils.gen_joined_str(' ', prefix_args)

def _build_platform_describe_node_instance(self, node: ClusterNode) -> list:
cmd_params = ['aws ec2 describe-instance-types',
'--region', f'{self.get_region()}',
Expand Down
32 changes: 32 additions & 0 deletions user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,38 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
self.logger.error('Invalid arguments to pull the cluster properties')
return self.run_sys_cmd(get_cluster_cmd)

def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['ssh',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-p {port}',
f'ubuntu@{node.name}']
return Utils.gen_joined_str(' ', prefix_args)

def _build_cmd_scp_to_node(self, node: ClusterNode, src: str, dest: str) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['scp',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-P {port}',
src,
f'ubuntu@{node.name}:{dest}']
return Utils.gen_joined_str(' ', prefix_args)

def _build_cmd_scp_from_node(self, node: ClusterNode, src: str, dest: str) -> str:
port = self.env_vars.get('sshPort')
key_file = self.env_vars.get('sshKeyFile')
prefix_args = ['scp',
'-o StrictHostKeyChecking=no',
f'-i {key_file} ' if key_file else '',
f'-P {port}',
f'ubuntu@{node.name}:{src}',
dest]
return Utils.gen_joined_str(' ', prefix_args)

def process_instances_description(self, raw_instances_description: str) -> dict:
processed_instances_description = {}
instances_description = JSONPropertiesContainer(prop_arg=raw_instances_description, file_load=False)
Expand Down
2 changes: 1 addition & 1 deletion user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def exec_platform_describe_accelerator(self,
self.get_env_var('zone')]
return self.run_sys_cmd(cmd_args)

def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
pref_args = ['gcloud',
'compute', 'ssh',
node.name,
Expand Down
2 changes: 1 addition & 1 deletion user_tools/src/spark_rapids_pytools/cloud_api/emr.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ def pull_cluster_props_by_args(self, args: dict) -> str:
return EMRPlatform.process_raw_cluster_prop(raw_prop_container)
return cluster_described

def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
# get the pem file
pem_file_path = self.env_vars.get('keyPairPath')
prefix_args = ['ssh',
Expand Down
4 changes: 2 additions & 2 deletions user_tools/src/spark_rapids_pytools/cloud_api/sp_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,7 +463,7 @@ def append_to_cmd(original_cmd, extra_args: list) -> Any:
sys_cmd = SysCmd().build(cmd_args)
return sys_cmd.exec()

def _build_ssh_cmd_prefix_for_node(self, node: ClusterNode) -> str:
def _build_cmd_ssh_prefix_for_node(self, node: ClusterNode) -> str:
del node # Unused by super method.
return ''

Expand All @@ -479,7 +479,7 @@ def _construct_ssh_cmd_with_prefix(self, prefix: str, remote_cmd: str) -> str:
return f'{prefix} {remote_cmd}'

def ssh_cmd_node(self, node: ClusterNode, ssh_cmd: str, cmd_input: str = None) -> str:
prefix_cmd = self._build_ssh_cmd_prefix_for_node(node=node)
prefix_cmd = self._build_cmd_ssh_prefix_for_node(node=node)
full_ssh_cmd = self._construct_ssh_cmd_with_prefix(prefix=prefix_cmd, remote_cmd=ssh_cmd)
return self.run_sys_cmd(full_ssh_cmd, cmd_input=cmd_input)

Expand Down
2 changes: 1 addition & 1 deletion user_tools/src/spark_rapids_pytools/rapids/diagnostic.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def _collect_info(self, node):
self._upload_scripts(node)

remote_output_folder = self.ctxt.get_remote('outputFolder')
ssh_cmd = f'"PREFIX={remote_output_folder} /tmp/collect.sh"'
ssh_cmd = f'"PREFIX={remote_output_folder} PLATFORM_TYPE={self.platform_type} /tmp/collect.sh"'

try:
self.logger.info('Collecting info on node: %s', node.get_name())
Expand Down
72 changes: 58 additions & 14 deletions user_tools/src/spark_rapids_pytools/resources/collect.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ mkdir -p $TEMP_PATH
NODE_ID=`hostname`
OUTPUT_NODE_INFO="$TEMP_PATH/$HOSTNAME.info"

if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
DATABRICKS_HOME='/databricks'
SPARK_HOME="$DATABRICKS_HOME/spark"
else
SPARK_HOME='/usr/lib/spark'
fi

echo "[OS version]" >> $OUTPUT_NODE_INFO
cat /etc/os-release >> $OUTPUT_NODE_INFO

Expand All @@ -48,11 +55,27 @@ free -h >> $OUTPUT_NODE_INFO

echo "" >> $OUTPUT_NODE_INFO
echo "[Network adapter]" >> $OUTPUT_NODE_INFO

if command -v lshw ; then
lshw -C network >> $OUTPUT_NODE_INFO
else
# Downgrade to 'lspci' on EMR as it's not installed by default
/usr/sbin/lspci | grep 'Ethernet controller' >> $OUTPUT_NODE_INFO
# Downgrade to 'lspci'
if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
if command -v sudo ; then
sudo apt install -y pciutils
fi
if command -v lspci ; then
lspci | { grep 'Ethernet controller' || true; } >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
elif [ "$PLATFORM_TYPE" == "emr" ]; then
if command -v /usr/sbin/lspci ; then
/usr/sbin/lspci | { grep 'Ethernet controller' || true; } >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
fi
fi

echo "" >> $OUTPUT_NODE_INFO
Expand All @@ -64,8 +87,12 @@ echo "[GPU adapter]" >> $OUTPUT_NODE_INFO
if command -v lshw ; then
lshw -C display >> $OUTPUT_NODE_INFO
else
# Downgrade to 'lspci' on EMR as it's not installed by default
/usr/sbin/lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
# Downgrade to 'lspci'
if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
elif [ "$PLATFORM_TYPE" == "emr" ]; then
/usr/sbin/lspci | { grep '3D controller' || true; } >> $OUTPUT_NODE_INFO
fi
fi

echo "" >> $OUTPUT_NODE_INFO
Expand All @@ -82,19 +109,31 @@ java -version 2>> $OUTPUT_NODE_INFO

echo "" >> $OUTPUT_NODE_INFO
echo "[Spark version]" >> $OUTPUT_NODE_INFO
spark-submit --version 2>> $OUTPUT_NODE_INFO

if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
if [ -f $SPARK_HOME/VERSION ]; then
echo "$(cat "$SPARK_HOME/VERSION")" >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
else
if command -v $SPARK_HOME/bin/pyspark ; then
$SPARK_HOME/bin/pyspark --version 2>&1|grep -v Scala|awk '/version\ [0-9.]+/{print $NF}' >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
fi

echo "" >> $OUTPUT_NODE_INFO
echo "[Spark rapids plugin]" >> $OUTPUT_NODE_INFO

if [ -z "$SPARK_HOME" ]; then
# Source spark env variables if not set $SPARK_HOME, e.g. on EMR node
source /etc/spark/conf/spark-env.sh
fi

if [ -f $SPARK_HOME/jars/rapids-4-spark*.jar ]; then
ls -l $SPARK_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
elif [ -f /usr/lib/spark/jars/rapids-4-spark_n-0.jar ]; then
if [[ "$PLATFORM_TYPE" == *"databricks"* ]]; then
if [ -f $DATABRICKS_HOME/jars/rapids-4-spark*.jar ]; then
ls -l $DATABRICKS_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
elif [ -f $SPARK_HOME/jars/rapids-4-spark*.jar ]; then
ls -l $SPARK_HOME/jars/rapids-4-spark*.jar >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
Expand All @@ -105,7 +144,12 @@ echo "[CUDA version]" >> $OUTPUT_NODE_INFO
if [ -f /usr/local/cuda/version.json ]; then
cat /usr/local/cuda/version.json | grep '\<cuda\>' -A 2 | grep version >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
# Fetch CUDA version from nvidia-smi
if command -v nvidia-smi ; then
nvidia-smi --query | grep "CUDA Version" | awk '{print $4}' >> $OUTPUT_NODE_INFO
else
echo 'not found' >> $OUTPUT_NODE_INFO
fi
fi

# Copy config files
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
},
"environment": {
"//description": "Define the metadata related to the system, prerequisites, and configurations",
"envParams": ["profile", "awsProfile", "deployMode"],
"envParams": ["profile", "awsProfile", "deployMode", "sshPort", "sshKeyFile"],
"//initialConfigList": "represents the list of the configurations that need to be loaded first",
"initialConfigList": ["profile", "awsProfile", "cliConfigFile", "awsCliConfigFile", "awsCredentialFile"],
"//loadedConfigProps": "list of properties read by the configParser",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
},
"environment": {
"//description": "Define the metadata related to the system, prerequisites, and configurations",
"envParams": ["profile", "deployMode", "azureRegionSection", "azureStorageSection"],
"envParams": ["profile", "deployMode", "azureRegionSection", "azureStorageSection", "sshPort", "sshKeyFile"],
"//initialConfigList": "represents the list of the configurations that need to be loaded first",
"initialConfigList": ["profile", "cliConfigFile", "azureCliConfigFile", "region", "azureRegionSection", "azureStorageSection"],
"//loadedConfigProps": "list of properties read by the configParser",
Expand Down
Loading

0 comments on commit fd0541e

Please sign in to comment.