-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581
Comments
@viadea is the main problem that we want to solve with this issue is that a customer isn't able to provide CPU cluster shape for cost estimation purposes? Or is it something else? |
The main problem that the event logs from customer are not not based on a single cluster shape, so they need to remove the --cpu-cluster option or directly using jar version. |
To be clear, the CPU cluster shape is not used in the speedup estimation, but only with the cost estimation. So we could try to infer the instance type from the executor information in the event log, but that would only impact the cost estimation for projected CPU cluster shape (and subsequent GPU cluster shape). |
Draft of scope and requirements:
For other platforms such as Databricks, the instance type may be represented in the event log and could be used in place of a default for (2). This path of execution can be off by default but triggered by a flag such as |
Divided in two parts:
|
Here is a design overview for this feature: Design:
Implementation Details:Construction of Cpu Cluster object in User Tools:
For example,
For example,
MethodI plan to divide in two tasks:
|
This looks great. One consideration -- how could we also support different instance type families for a given CSP? Is it possible to see executor memory to find out if instance if highmem or normal or even high disk or normal? |
Also -- can we use a JSON or YAML format instead of CSV to pass between core tools and user tools for this? Seems like that will be easier to maintain. |
We can use
I selected CSV because of two reasons (1) customers can view the cluster inference file and verify (2) existing output were based on csv format. However, based on discussion with @amahussein, we decided to store cluster information in JSON format as it will be simpler to parse JSON in Sample JSON output:
|
note some new cluster node recommendations are happening in #1160 so this should wait for that use that node recommendations |
This completed as #1160 is merged and cost savings are turned-off. |
I wish Qualification tool can detect the CPU jobs' cluster shape and then provide the suggestion based on that.
Currently the qualification tool as designed uses a single cluster shape as input for the set of logs it is analyzing. the user would have to run the qual tool separately for the batch of logs for each unique cluster shape.
A common scenarios is:
The user who run the Qualification tool may not be the jobs' owner, as a result it is difficult for them to firstly split the jobs based on the cluster shape into different batches.
They just want to run the Qualification tool on all of the jobs at once.
If the Qualification tool can detect the worker node information from the each individual event log, then we do not need the cluster shape information as the input.
For example, at least from the Databricks event logs, it has the worker type information.
Tasks
The text was updated successfully, but these errors were encountered: