-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer cluster shape from single event log in Qualification tool #687
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
ce3ccf0
to
12e7668
Compare
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There is only one driver. You can have extra processes for like Application Master if running on yarn in client mode. We may care about those but can probably be inferred from the master and mode running.
Maybe this needs clarified as to when we care about number of nodes. I assume here you are tracking the Host field in the SparkListenerBlockManagerAdded, otherwise looking at executor ID you are tracking executors not hosts. If you are looking at unique Host IPs then you get number of nodes. Number of executors with be unique host ips and port combinations but you can use the ExecutorAddedbelow.
Note you also have to look for removed, especially in the case with dynamic allocation. The point being here what exactly are we trying to figure out? If the user specified number of executor instances then it would be in the config, there is also a min and max number with dynamic allocation. Just because they specify it doesn't mean they actually got that number so its good to check. With dynamic allocation if you are trying to determine the max at any time then you have to go through and do the adds and removes and keep track of the max. Even for Databricks you may ant to track the executor information because even though they select a node type they may run more then 1 executors on it if they configure things properly (this is not the usual case but needs to be accounted for) |
Thanks @tgravescs for the feedback. We are trying to figure out the shape of CPU cluster (number of nodes used for driver, executor and their types) for calculating the cost of running the cluster.
Yes, we are looking at unique Host IPs to extract number of nodes.
In |
So just going by description, I'm not sure what our intended output is. You mention above its just number of specific host types in the cloud then (onprem not supported)? ie so final output I would expect to be like Driver node type = hosttype_y, number of nodes for exeuctors = 4 executor node type = hosttype_x. how are you deciding its hosttype_y or hosttype_x without some config being explicit? ie I have 2 executors with 8 cores each and 32GB memory. Do you care that is a m6gd.4xlarge vs an i3.4xlarge or something like that? Or do you need to know that? I'd have to look at each csp event logs to know if its somewhere in there. |
Signed-off-by: Partho Sarthi <[email protected]>
I have added log statement that shows the inferred cluster (see output in description for all CSPs)
Using EMR example, In EMR, we add the following entries in emr-config.json:
Currently, we do not differentiate based on memory and only support single series.
OnPrem cluster inference follows a slightly different path. To avoid making this PR larger, I can create a follow-up PR for OnPrem support. |
overall approach seems fine based on the requirements. I think the on prem gets a bit more tricky but that is separate issue. |
Thanks @tgravescs. I improved the logic to remove host from the set during Example,
Now, in this case, we will correctly count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look to the changes, but I need to spend more time going through the discussions.
I have some concerns related to the design:
-
The tools-core module is in charge of loading and processing the eventlogs.
- Ideally, we don't want to parse an eventlog (1st using python then using Spark APIs). This kills the performance in half.
- We won't be able to handle large files unless we use some wrappers that can handle credentials and buffering large eventlogs. The changes in the PR does not seem to get around that.
-
If we move the extraction of cluster shape to scala, then we can also enable the same feature for the
autotuner
when needed.
I strongly recommend that the tools-core handles any parsing and extraction related to the eventlog. This means that tools-core generates a new file that shows the CPU cluster shape for each appID.
The user-tools module then loads that file (if any) for cost-savings calculations.
I will take some time to think about it.
@parthosa Let's hold on this until we have some discussions about the design |
Thanks @amahussein for the feedback.
Changing the PR to draft for now. |
Closing the PR as we need to do the implementation on Scala side |
Contributes #581. This PR infers CPU cluster shape by parsing the event log. This will be used for savings estimation.
Inference Logic
num_drivers_nodes
:driver_instance
:Spark Properties
->spark.databricks.driverNodeTypeId
num_executor_nodes
:SparkListenerBlockManagerAdded
->Block Manager ID
->Executor ID
!=driver
executor_instance
:Spark Properties
->spark.databricks.executorNodeTypeId
num_cores
:SparkListenerExecutorAdded
->Executor Info
->Total Cores
{"name": "m5d.2xlarge", "vCPUs": 8},
Changes
cluster_inference.py
- Implements the above logic.qualification.py
- Modify code to use above inference_construct_cluster_config
to read default cluster config and update fields related to shape (num_executors
,executor_instance
)describe
)Input/Output
EMR - Inference Supported
CMD:
spark_rapids_user_tools emr qualification --eventlogs ~/eventlog_file -v
Dataproc - Inference Supported
CMD:
spark_rapids_user_tools dataproc qualification --eventlogs ~/eventlog_file -v
Databricks AWS - Inference Supported
CMD:
spark_rapids_user_tools databricks-aws qualification --eventlogs ~/eventlog_file -v
Databricks Azure - Inference Supported
CMD:
spark_rapids_user_tools databricks-azure qualification --eventlogs ~/eventlog_file -v
OnPrem - Inference Not Supported
CMD:
spark_rapids_user_tools onprem qualification --eventlogs ~/eventlog_file -v
Dataproc-GKE - Inference Not Supported
CMD:
spark_rapids_user_tools dataproc-gke qualification --eventlogs ~/eventlog_file -v
Eventlog is a directory - Inference Not Supported
CMD:
spark_rapids_user_tools emr qualification --eventlogs ~/eventlog_dir -v