- 
                Notifications
    You must be signed in to change notification settings 
- Fork 17
Grove proposal/topology #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ator Signed-off-by: Ron Kahn <[email protected]>
Signed-off-by: Ron Kahn <[email protected]>
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these | ||
| applications require: | ||
|  | ||
| - **Network Locality**: High-bandwidth communication between prefill and decode workers benefits from proximity | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression we mainly aim to pack different roles of the same inference pipeline phase: workers and leader. At the moment, we were not requested to specifically pack prefill and decode workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its was left over from the support of group affinities, before we decided to postpone this required
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | PodClique) | ||
|  | ||
| **Key Feature**: Grove provides automatic out-of-box topology optimization by generating preferred packing constraints | ||
| at all levels, even without user configuration. Users can optionally specify required constraints for strict placement | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it say something like "at the global/workload level"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are adding preferred to each level, so it will as close it can be, and not only at workload level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the out-of-box topology optimization that is generated by Grove especially when no pack configuration is defined by the user? Ideally if no pack configuration is specified then should it not mean that there is no strict requirement or in other words there is no packing constraints that is observed by the scheduler while taking scheduling decisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no configuration is provided, and the user has no real constraints that they need to follow, how would they be able to opt out if Grove always tries for the optimal by default?
If by default Grove tries for the optimal placement, and the scheduler is unable to schedule pods with these constraints, then the workload will never fully be up even though the user has no requirement for it.
Perhaps we can provide a field in the API through which the user indicates that they want out-of-box topology optimization. Or, we can provide a way for user to opt out of these out-of-box optimizations if they do not really care?
typo fix Co-authored-by: Roman Baron <[email protected]> Signed-off-by: Ron Kahn <[email protected]>
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Out-of-Box Optimization:** | ||
|  | ||
| - Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") | ||
| - Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not only at the PodGang level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to make the scheduler prioritize keeping pods with a shared rule or relationship (like a 'leader' and its 'workers') together.
To illustrate: Imagine a host has capacity for one more pod. The scheduler has two choices:
- 
An unrelated pod from the same workload. 
- 
A 'worker' pod that belongs to a service already running on that host. 
With this change, the scheduler should strongly prefer option 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically prefill/decode worker/leader will be realized as a PCSG. Each replica of PCSG will be translated to a PodGang, so what you want @romanbaron will already happen for the case that @Ronkahn21 is describing above.
However this cannot be a generic rule, you would need flexibility to define pack constraints at every level.
Signed-off-by: Ron Kahn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Added some comments to simplify the implementation.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | ||
| 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix a default topology CR name for the controller, e.g. grove-topology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could but what will the value of this? the admin create the CR, He could choose any name he want,
if we were the one to create the crd I will agree with you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the user creates multiple instances of this CustomResource? How will Grove decide which to pick?
During runtime, there is only one CR that Grove must use for specifying topologies.
If we allow any name, we will then have to implement a validating webhook that ensures that only one instance of this CustomResource exists in the entire cluster. This seems more complicated the enforcing the name of the CustomResource that is to be created, which Grove will always look for.
If the consumer chooses to change the topology, then they will edit this configuration CR, which Grove will make use of by fetching it and making use of the newly specified mappings before creating any resources.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | ### Component Interactions | ||
|  | ||
| ``` | ||
| TopologyDomain CRD ──┐ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD -> CR
|  | ||
| The TopologyDomain controller manages: | ||
|  | ||
| - **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- CRD -> CR
- I don't think this should be Grove operator responsibility. Instead, KAI PodGrouper's Grove plugin can do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I am a bit perplexed why Grove and KAI wouldnt use same CRD
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | │ Admin Layer: │ | ||
| │ ┌──────────────────┐ ┌────────────────────┐ │ | ||
| │ │ TopologyDomain │─────────────▶│ TopologyDomain │ │ | ||
| │ │ CRD │ │ Controller │ │ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD -> CR
| **Example TopologyDomain:** | ||
|  | ||
| ```yaml | ||
| apiVersion: grove.run.ai/v1alpha1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grove.run.ai/v1alpha1 -> grove.io/v1alpha1
| // TopologyDomain defines the topology hierarchy for the cluster | ||
| // This resource is immutable after creation | ||
| // Multiple TopologyDomain resources can exist; Grove operator references one via --topology-domain-name argument | ||
| type TopologyDomain struct { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to just Topology ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we need to use both our topology and kueue topology and kueue already taken the name Topology,
I was thinking it would make confusion and will annoying to add the full name (include api group and version) of each topology.
what do you think?
| Spec TopologyDomainSpec `json:"spec,omitempty"` | ||
| } | ||
|  | ||
| type TopologyDomainSpec struct { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to TopologySpec
|  | ||
| The TopologyDomain controller manages the TopologyDomain resource lifecycle with two primary responsibilities: | ||
|  | ||
| **1. Kueue Topology Generation** | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This responsibility should be with KAI PodGrouper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its depend who you ask, some the kai team didnt agree about this
(we could do it for kubecon, but overall I believe we need to integrate with other and not expect other to do it for us), some day vlolcano might support this api, They will never create the topology they using for us
| // TopologyRef references the Kueue Topology resource | ||
| // Points to Kueue Topology CRD auto-generated by TopologyDomain controller | ||
| // +optional | ||
| TopologyRef *string `json:"topologyRef,omitempty"` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't need this if we move the Kueue topology generation to KAI PodGrouper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need it, if the name of the topology isn't fixed, as it will need to know the name the user topology and create kueue topology out of it
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| - **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( | ||
| prefill and decode, etc) | ||
| - **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the indentations in the file should be fixed, they're not formatted properly. Both bullet and numbered ones.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | ||
| - Immutable topology configuration ensuring scheduling consistency | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that in order to fix or edit an existing configuration, the admin will have to create a new one, switch to it and delete the previous one.
Can we consider allowing edits if no PCS is currently using the topology? Similarly to how deletions are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct, it a bit wired, I will change this section and clearfy that deletion fo topology is allowed only if no one ref them, it will be easier and more logical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a PCS does end up using this CR, then we are stuck with it forever/until that PCS is deleted.
An easier way to tackle this is to specify a reference to an instance of the configuration CR? Newly created PCS-s can rely on a different instance of the CR, which has the correct configuration?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | 2. **Operator Configuration**: References TopologyDomain by name | ||
| - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use | ||
| - All workload validation performed against configured TopologyDomain | ||
| - Enables switching between topologies without changing workloads | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here? Maybe "without changing the workload specs"? That's also not always true, it depends on the nature of the change in topology configuration (e.g. whether level names were changed.)
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| - Cluster administrators are responsible for ensuring that node labels specified in `topologyKey` fields exist on | ||
| cluster nodes | ||
| - TopologyDomain creation succeeds even if labels don't exist yet (allows pre-configuration) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the requirement for pre-configuration, but I believe we should at least offer the admin to provide validation for the TopologyDomain resource. Leaving misconfiguration errors to the user, upon workload creation, is (a) too late and (b) not the right persona to handle.
I suggest to allow some annotation for example (either opt-in or out) that instructs the operator to validate the topology resource against the nodes in the cluster.
Sample basic validation we can apply there: each of the node labels must exist on at least one node in the cluster. Perhaps we can think of a stronger validation, but that's a start and will catch simple typos.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | Configuration: | ||
|  | ||
| - `--topology-domain-name`: Specifies TopologyDomain resource name for validation | ||
| - Operator loads referenced TopologyDomain at startup | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention here that this is an optional argument, the operator can run without it.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| **Immutability:** | ||
|  | ||
| - All TopologyConstraint fields immutable after resource creation | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? Do we plan to lift this limitation in the future, and describing here the contents of the first phase?
I mean, we can allow changing the topology constraint of a component in two cases:
- It doesn't conflict with the overall topology spec of the PCS as a whole.
- The rolling update is for the whole PCS (rather than individual components), a case in which a complete PCS replica is going to be created, and it can use the new topology constraints.
But again, if this note is mentioned here as "out of scope for now" I definitely agree we better not deal with this complication at this stage.
|  | ||
| **Preferred Constraints (Auto-Generated):** | ||
|  | ||
| - Operator ALWAYS generates preferred constraint at all three levels | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but this is true unless the user provided a "Required" constraint for the same (strictest) level. Or we don't care and set preferred in addition, as it's harmless?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | - Multiple TopologyDomains supported for different environments | ||
|  | ||
| 2. **Operator Configuration**: References TopologyDomain by name | ||
| - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to have a annotation in the TopologyDomain CR "is-default: true" that only one topology has, and grove knows which one is the default.
A benefit of that, is that you can change the used topology without restarting the grove controller.
The bad side is that you need to handle what happen if there are 2 instances or more with that annotation.
Thats how default StorageClass is implemented in K8S: https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | 2. **Operator Configuration**: References TopologyDomain by name | ||
| - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use | ||
| - All workload validation performed against configured TopologyDomain | ||
| - Enables switching between topologies without changing workloads | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user has the ability to override that? Or will have in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only partly done the review. Posting comments in batches.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TopologyDomain CRD required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm proposing a CRD for cluster topology to support complex scenarios, like a single cluster with multiple network layouts (e.g., for hybrid on-prem/cloud environments).
This lets admins dynamically define multi-level topologies with user-friendly names (rack, spine) and is far more extensible than a hard-coded list or ConfigMap, allowing for a future status field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can friendly names be expanded on here? Perhaps examples like rack, spine, and host can be added in brackets so the sentence is clearer in a first pass?
Or we can instead write cluster topology hierarchy mapping Grove defined cluster topologies to node labels?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | ||
| 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is active topology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default topology configured by admin, and applied to all the pcs submitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not provide configuration flags like --topology-domain-name in Grove. The operator can only be configured through OperatorConfiguration, where each configuration is a field/subfield in the manfiest file provided. So this could be OperatorConfiguration.TopologyDomainName.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | PodClique) | ||
|  | ||
| **Key Feature**: Grove provides automatic out-of-box topology optimization by generating preferred packing constraints | ||
| at all levels, even without user configuration. Users can optionally specify required constraints for strict placement | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the out-of-box topology optimization that is generated by Grove especially when no pack configuration is defined by the user? Ideally if no pack configuration is specified then should it not mean that there is no strict requirement or in other words there is no packing constraints that is observed by the scheduler while taking scheduling decisions?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| - Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD | ||
| - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does environment mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster with hybrid on-prem/cloud environments.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | - Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD | ||
| - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who generates the keueu topology CR. It should not be done in Grove IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of KAI maintains prefer that grove will integrate with KAI and not the other way around, I belive we are the one that support to integrate with other schedulers today it's KAI but tomorrow it might be Volcano, so if KAI work in a cerain way we are the one that should create it IMHO, (also KAI need to agree for this)
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| - Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") | ||
| - Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API | ||
| - Users get optimal packing without configuration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the correct as described above. Grove does not understand the workload semantics and therefore is in no position to determine the optimal packing configuration for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I describe it to optimal, it better to say it provides an opportunistic performance boost
|  | ||
| **User Control:** | ||
|  | ||
| - Users can specify **required** constraints via `packDomain` for strict placement requirements | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can remove the word strict here as users can also define relaxed pack constraints as well.
| // Description provides human-readable information about this level | ||
| // +kubebuilder:validation:MaxLength=1024 | ||
| // +optional | ||
| Description string `json:"description,omitempty"` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this required? who will use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought i will helpful for users who want to understand better the levels but it only suggestion
| ├─────────────────────────────────────────────────────────────────────────┤ | ||
| │ │ | ||
| │ Admin Layer: │ | ||
| │ ┌──────────────────┐ ┌────────────────────┐ │ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you recommend a different controller for it when it is part of the creation flow of a PCS?
|  | ||
| ### Controller Responsibilities | ||
|  | ||
| The TopologyDomain controller manages: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should Grove manage any Kueue CRs? Today KAI scheduler decides to have dependency on Kueue but that is the choice that KAI makes. So while Grove today uses KAI scheduler its responsibility should not be tied to a single scheduler (in this case KAI scheduler). Therefore creation of CRs that are required by KAI should be done in KAI using the enhancements to the PodGang scheduler API that Grove defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with you about this,KAI are doing us a favor here (they would not agree it on them), if you want to integrate with them not the other way around, in the feature we would want to integrate Volcano/unicron/ or even the default scheduler you need to work with the API they agree upon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the proposal!
1/n as I have yet to go through most of the document.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | ||
| 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument | ||
| 3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, | ||
| PodClique) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each numbered point is not being rendered in its own line because of the indentation. Can this be corrected? There are also multiple other places in the file where there are unnecessary newlines/indentation which causes the rendered markdown to look weird.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| ## Overview | ||
|  | ||
| This document defines the design for implementing topology-aware scheduling in the Grove operator. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| This document defines the design for implementing topology-aware scheduling in the Grove operator. | |
| This document defines the design for supporting topology-aware scheduling in the Grove operator. | 
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | applications require: | ||
|  | ||
| - **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( | ||
| prefill and decode, etc) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
A period is typically added after the abbreviation etc.. Also, unnecessary new line.
| prefill and decode, etc) | |
| prefill and decode, etc.) | 
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      |  | ||
| **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can friendly names be expanded on here? Perhaps examples like rack, spine, and host can be added in brackets so the sentence is clearer in a first pass?
Or we can instead write cluster topology hierarchy mapping Grove defined cluster topologies to node labels?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | ||
| 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the user creates multiple instances of this CustomResource? How will Grove decide which to pick?
During runtime, there is only one CR that Grove must use for specifying topologies.
If we allow any name, we will then have to implement a validating webhook that ensures that only one instance of this CustomResource exists in the entire cluster. This seems more complicated the enforcing the name of the CustomResource that is to be created, which Grove will always look for.
If the consumer chooses to change the topology, then they will edit this configuration CR, which Grove will make use of by fetching it and making use of the newly specified mappings before creating any resources.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | **Design Approach**: This design introduces a flexible topology system with three main components: | ||
|  | ||
| 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels | ||
| 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not provide configuration flags like --topology-domain-name in Grove. The operator can only be configured through OperatorConfiguration, where each configuration is a field/subfield in the manfiest file provided. So this could be OperatorConfiguration.TopologyDomainName.
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | PodClique) | ||
|  | ||
| **Key Feature**: Grove provides automatic out-of-box topology optimization by generating preferred packing constraints | ||
| at all levels, even without user configuration. Users can optionally specify required constraints for strict placement | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no configuration is provided, and the user has no real constraints that they need to follow, how would they be able to opt out if Grove always tries for the optimal by default?
If by default Grove tries for the optimal placement, and the scheduler is unable to schedule pods with these constraints, then the workload will never fully be up even though the user has no requirement for it.
Perhaps we can provide a field in the API through which the user indicates that they want out-of-box topology optimization. Or, we can provide a way for user to opt out of these out-of-box optimizations if they do not really care?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | ||
| - Immutable topology configuration ensuring scheduling consistency | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the configuration is changed, and the PCS is mutated, will Grove then go ahead and recreate all the children PCSG and PCLQs because the previous configurations were less strict, and now Grove enforces the more strict configuration?
        
          
                docs/designs/topology.md
              
                Outdated
          
        
      | - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | ||
| - Immutable topology configuration ensuring scheduling consistency | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a PCS does end up using this CR, then we are stuck with it forever/until that PCS is deleted.
An easier way to tackle this is to specify a reference to an instance of the configuration CR? Newly created PCS-s can rely on a different instance of the CR, which has the correct configuration?
Signed-off-by: Ron Kahn <[email protected]>
Signed-off-by: Ron Kahn <[email protected]>
Signed-off-by: Ron Kahn <[email protected]>
Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Ron Kahn <[email protected]>
| - Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD | ||
| - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU this is not a goal anymore, so please remove it. Also appears in other parts of the document, search for "kueue" to find them.
| - Enable packing constraints for network locality across all Grove scalable resources | ||
| - Support multiple topology configurations for different environments | ||
| - Automatic Kueue Topology generation for KAI scheduler integration | ||
| - Immutable topology configuration ensuring scheduling consistency | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be more explicit here and clarify that the immutable topology is of the global configuration, not the require topology of a certain workload (as there it's now mutable.)
|  | ||
| **Out-of-Box Optimization:** | ||
|  | ||
| - Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this agreed or still under discussion? @unmarshall commented about it in a different place.
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
This PR adds a comprehensive design document for topology-aware scheduling in the Grove operator. The design introduces a flexible topology system that enables optimal placement of multinode inference
workloads based on cluster network topology.
Key components:
--topology-domain-nameargumentThe design addresses critical requirements for multinode inference workloads including network locality, coordinated placement, and latency optimization.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a API change?
suggest change