From ad9b2dba51eb63299700c706fb96bfea2d88eeda Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Tue, 21 Oct 2025 12:58:38 +0300 Subject: [PATCH 01/15] feat: add design document for topology-aware scheduling in Grove operator Signed-off-by: Ron Kahn --- docs/designs/topology.md | 908 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 908 insertions(+) create mode 100644 docs/designs/topology.md diff --git a/docs/designs/topology.md b/docs/designs/topology.md new file mode 100644 index 00000000..4b4b0070 --- /dev/null +++ b/docs/designs/topology.md @@ -0,0 +1,908 @@ +# Topology-Aware Scheduling - Grove Operator Design + +## Overview + +This document defines the design for implementing topology-aware scheduling in the Grove operator.. + +**Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these +applications require: + +- **Network Locality**: High-bandwidth communication between prefill and decode workers benefits from proximity + - **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same + topology domain + - **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-end + performance + +**Design Approach**: This design introduces a flexible topology system with three main components: + +1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels + 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument + 3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, + PodClique) + +**Key Feature**: Grove provides automatic out-of-box topology optimization by generating preferred packing constraints +at all levels, even without user configuration. Users can optionally specify required constraints for strict placement +requirements. + +## Goals + +- Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD + - Enable packing constraints for network locality across all Grove scalable resources + - Support multiple topology configurations for different environments + - Automatic Kueue Topology generation for KAI scheduler integration + - Immutable topology configuration ensuring scheduling consistency + - Hierarchical constraint validation (child stricter than parent) + +## Non-Goals + +- Spread constraints across topology domains (ReplicaSpreadDomain) + - Root domain constraints for entire resource (RootDomain) + - Ratio-based affinity groups between scaling groups (AffinityGroups with PackRatio) + - Dynamic topology reconfiguration after creation + - Per-workload topology domain selection + - Automatic topology inference from workload characteristics + +## Proposal + +### High-Level Approach + +Grove implements topology-aware scheduling through three key components: + +1. **TopologyDomain CRD**: Cluster-scoped resource defining topology hierarchy + - Admin creates TopologyDomain with ordered list of topology levels + - Each level maps friendly name (e.g., "rack") to node label key (e.g., "topology.kubernetes.io/rack") + - Multiple TopologyDomains supported for different environments + + 2. **Operator Configuration**: References TopologyDomain by name + - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use + - All workload validation performed against configured TopologyDomain + - Enables switching between topologies without changing workloads + + 3. **Workload API (TopologyConstraint)**: Users specify packing requirements + - PodCliqueSet, PodCliqueScalingGroup, and PodClique each have TopologyConstraint field + - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) + - No direct TopologyDomain reference needed in workloads + +### Component Interactions + +``` +TopologyDomain CRD ──┐ + (admin creates) │ + │ +Operator Config ─────┼──> Operator validates PackDomain + (--topology- │ against TopologyDomain.Spec.Levels + domain-name) │ + │ +PodCliqueSet ────────┘ + (packDomain: "rack") +``` + +### Automatic Optimization + +**Out-of-Box Optimization:** + +- Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") + - Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API + - Users get optimal packing without configuration + +**User Control:** + +- Users can specify **required** constraints via `packDomain` for strict placement requirements + - Required constraints validated and must be satisfied + - Preferred constraints enable best-effort optimization with graceful fallback + +### Controller Responsibilities + +The TopologyDomain controller manages: + +- **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration + - **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it + +## Out of Scope + +The following features are explicitly out of scope for this design: + +- **Spread Constraints**: ReplicaSpreadDomain for distributing replicas across domains for fault tolerance is not + supported + - **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a + topology domain is not supported + - **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + + 1 + Decode ratios) is not supported + - **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, + and + inference requirements + +## Design Details + +### Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ Topology Architecture │ +├─────────────────────────────────────────────────────────────────────────┤ +│ │ +│ Admin Layer: │ +│ ┌──────────────────┐ ┌────────────────────┐ │ +│ │ TopologyDomain │─────────────▶│ TopologyDomain │ │ +│ │ CRD │ │ Controller │ │ +│ │ (levels list) │ └─────────┬──────────┘ │ +│ └──────────────────┘ │ │ +│ │ │ │ +│ │ ▼ │ +│ │ ┌────────────────────┐ │ +│ │ │ Kueue Topology │ │ +│ │ │ (auto-generated) │ │ +│ │ └────────────────────┘ │ +│ │ │ +│ Operator Config: --topology-domain-name=default │ +│ │ │ +│ │ (validates against) │ +├─────────┼───────────────────────────────────────────────────────────────┤ +│ │ │ +│ User Layer: │ +│ ▼ │ +│ ┌──────────────────┐ ┌────────────────────┐ │ +│ │ PodCliqueSet │─────────────▶│ Grove Operator │ │ +│ │ (packDomain) │ │ (reconciles) │ │ +│ └──────────────────┘ └─────────┬──────────┘ │ +│ │ │ +│ │ (translates) │ +│ ▼ │ +│ ┌────────────────────┐ │ +│ │ PodGang │───────▶ KAI │ +│ │ • TopologyRef │ Scheduler │ +│ │ • 3-level topology │ │ +│ │ (required+ │ │ +│ │ preferred) │ │ +│ └────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### 1. TopologyDomain Infrastructure + +#### TopologyDomain CRD + +TopologyDomain is a cluster-scoped CRD that defines the topology hierarchy for scheduling. It maps friendly level names +to Kubernetes node labels and establishes ordering from broadest to narrowest scope. + +**Characteristics:** + +- **Cluster-scoped**: Multiple TopologyDomains can exist + - **Operator-selected**: Operator references one by name via `--topology-domain-name` argument + - **Immutable**: Once created, cannot be modified + - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) + +**API Structure:** + +```go +// TopologyDomain defines the topology hierarchy for the cluster +// This resource is immutable after creation +// Multiple TopologyDomain resources can exist; Grove operator references one via --topology-domain-name argument +type TopologyDomain struct { +metav1.TypeMeta `json:",inline"` +metav1.ObjectMeta `json:"metadata,omitempty"` + +Spec TopologyDomainSpec `json:"spec,omitempty"` +} + +type TopologyDomainSpec struct { +// Levels is an ordered list of topology levels from broadest to narrowest scope +// The order in this list defines the hierarchy (index 0 = highest level) +// This field is immutable +// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="levels list is immutable" +// +kubebuilder:validation:MinItems=1 +// +kubebuilder:validation:MaxItems=10 +Levels []TopologyLevel `json:"levels"` +} + +type TopologyLevel struct { +// Name is the level identifier used in TopologyConstraint references +// Must be a valid DNS label (lowercase alphanumeric with hyphens) +// Examples: "zone", "rack", "host" +// +kubebuilder:validation:Required +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=63 +// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$` +Name string `json:"name"` + +// TopologyKey is the node label key that identifies this topology domain +// Must be a valid Kubernetes label key (qualified name) +// Examples: "topology.kubernetes.io/zone", "kubernetes.io/hostname" +// +kubebuilder:validation:Required +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=316 +// +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$` +TopologyKey string `json:"topologyKey"` + +// Description provides human-readable information about this level +// +kubebuilder:validation:MaxLength=1024 +// +optional +Description string `json:"description,omitempty"` +} +``` + +**Example TopologyDomain:** + +```yaml +apiVersion: grove.run.ai/v1alpha1 +kind: TopologyDomain +metadata: + name: default +spec: + levels: + - name: region + topologyKey: "topology.kubernetes.io/region" + description: "Cloud provider region" + - name: zone + topologyKey: "topology.kubernetes.io/zone" + description: "Availability zone within region" + - name: datacenter + topologyKey: "topology.kubernetes.io/datacenter" + description: "Data center within zone" + - name: block + topologyKey: "topology.kubernetes.io/block" + description: "Switching block within datacenter" + - name: rack + topologyKey: "topology.kubernetes.io/rack" + description: "Network rack grouping" + - name: host + topologyKey: "kubernetes.io/hostname" + description: "Individual compute host" + - name: numa + topologyKey: "topology.kubernetes.io/numa" + description: "NUMA node within host" +``` + +**Creating TopologyDomain:** + +Steps: + +1. Install Grove: `helm install grove` + 2. Customize example above with your cluster's actual `topologyKey` values + 3. Create resource: `kubectl apply -f topologydomain.yaml` + 4. Configure operator with `--topology-domain-name` matching the resource name + 5. Create workloads with topology constraints + +Notes: + +- TopologyDomain becomes immutable after creation + - Multiple TopologyDomains can exist; operator uses the one specified in its argument + - Ensure node labels exist on cluster nodes before creating workloads + - List order defines hierarchy: index 0 = broadest, last = narrowest + - Example hierarchy: `region` (0) > `zone` (1) > `datacenter` (2) > `block` (3) > `rack` (4) > `host` (5) > `numa` ( + 6) + +**Validation:** + +CRD-Level: + +- At least one level required (minimum 1, maximum 10) + - Level `name` required (max 63 chars) + - Level `topologyKey` required (max 316 chars) + - Level `description` optional (max 1024 chars) + - Entire levels list immutable after creation + +Webhook: + +- Each level `name` must be unique within the `levels` array of a single TopologyDomain + - Each `topologyKey` must be unique within the `levels` array of a single TopologyDomain + - Cannot modify any field after creation + - Deletion protection via controller finalizer + +**Node Label Responsibility:** + +- Cluster administrators are responsible for ensuring that node labels specified in `topologyKey` fields exist on + cluster nodes + - TopologyDomain creation succeeds even if labels don't exist yet (allows pre-configuration) + - Workloads may fail to schedule if referenced topology labels are missing from nodes + - Administrators should verify node labels match TopologyDomain configuration before creating workloads + +#### TopologyDomain Controller + +The TopologyDomain controller manages the TopologyDomain resource lifecycle with two primary responsibilities: + +**1. Kueue Topology Generation** + +Automatically generates Kueue Topology CRD from the TopologyDomain referenced by operator's `--topology-domain-name` +argument. + +**Why Kueue Topology is Required:** + +Grove uses its own TopologyDomain CRD for user-friendly admin/user API, but KAI scheduler specifically requires Kueue's +Topology CRD format for actual scheduling operations. The TopologyDomain controller bridges this gap by: + +- Reading Grove's TopologyDomain (user-friendly with level names like "rack", "zone") + - Automatically generating Kueue Topology (KAI scheduler's required format with node labels only) + - Maintaining consistency between both representations + - Eliminating manual coordination for admins + +This separation allows Grove to provide better UX while maintaining compatibility with KAI scheduler requirements. + +Generation Process: + +1. Controller watches TopologyDomain specified in operator argument + 2. When TopologyDomain created, controller creates matching Kueue Topology + 3. Kueue Topology name matches TopologyDomain name + 4. Levels extracted from TopologyDomain.Spec.Levels using topologyKey field + 5. Order preserved from TopologyDomain list + +Example: + +From TopologyDomain `default` with levels zone/rack/host, controller generates: + +```yaml +apiVersion: kueue.x-k8s.io/v1alpha1 +kind: Topology +metadata: + name: default + ownerReferences: + - apiVersion: grove.run.ai/v1alpha1 + kind: TopologyDomain + name: default + controller: true +spec: + levels: + - nodeLabel: "topology.kubernetes.io/zone" + - nodeLabel: "topology.kubernetes.io/rack" + - nodeLabel: "kubernetes.io/hostname" +``` + +Key Points: + +- Admin only creates TopologyDomain; Kueue Topology auto-generated + - Owner reference ensures Kueue Topology deleted with TopologyDomain + - Same name for both resources + - No manual coordination required + +**Implementation Note:** + +To avoid importing the entire Kueue package with all its dependencies, the operator will use Kubernetes unstructured API +to create and manage Kueue Topology CRDs. This approach is acceptable since the Kueue Topology CRD structure is simple ( +just a list of node label keys). + +**2. Deletion Protection** + +Prevents TopologyDomain deletion while PodCliqueSet resources reference it using Kubernetes finalizer. + +Deletion Workflow: + +1. Admin runs `kubectl delete topologydomain default` + 2. Kubernetes blocks deletion (finalizer `grove.run.ai/topology-protection` present) + 3. Controller reconciles: + - Detects deletion request (deletion timestamp set) + - Scans cluster for any PodCliqueSet resources + - If PodCliqueSet exists: Keeps finalizer, deletion blocked + - If no PodCliqueSet exists: Removes finalizer, deletion proceeds + 4. Once finalizer removed, Kubernetes deletes TopologyDomain + +Why Only Check PodCliqueSet: + +- Grove ownership hierarchy: PodCliqueSet owns PodCliqueScalingGroup and PodClique + - If no PodCliqueSet exists, other resources cannot exist + +Key Points: + +- Admin must delete all PodCliqueSet before deleting TopologyDomain + - Controller continuously reconciles + - Prevents orphaned workloads with invalid topology references + +#### Operator Configuration + +Operator references TopologyDomain by name via command-line argument: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: grove-operator +spec: + template: + spec: + containers: + - name: operator + args: + - --topology-domain-name=default # References TopologyDomain by name +``` + +Configuration: + +- `--topology-domain-name`: Specifies TopologyDomain resource name for validation + - Operator loads referenced TopologyDomain at startup + - All PodCliqueSet topology constraints validated against this TopologyDomain + +**Runtime Behavior:** + +When TopologyDomain is Missing or Deleted: + +- **Startup**: If `--topology-domain-name` is configured but TopologyDomain doesn't exist at startup, operator fails to + start + - Operator requires TopologyDomain to exist for auto-optimization (preferred constraints generation) + - This explicit failure prevents silent degradation of topology features + - Admin must create TopologyDomain or remove `--topology-domain-name` argument before operator starts + +- **During Runtime**: If TopologyDomain is deleted while operator is running: + - Finalizer prevents deletion while any PodCliqueSet resources exist + - If all PodCliqueSet resources are removed and TopologyDomain is deleted: + - Operator blocks creation of ALL new workloads (topology and non-topology) + - Admin must either create new TopologyDomain OR remove `--topology-domain-name` operator argument and restart + - This explicit behavior prevents implicit edge cases and ensures topology configuration consistency + +**Multiple Topologies:** + +- Multiple TopologyDomain resources can exist (e.g., "aws-topology", "on-prem-topology") + - Operator argument selects which one to use + - Enables different topology configurations per environment + +### 2. Operator API Changes (Grove CRDs) + +#### TopologyConstraint Model + +```go +type TopologyConstraint struct { +// PackDomain references a level name from TopologyDomain.Spec.Levels +// Defines required topology packing constraint for replicas +// Replicas packed together within specified topology level for network locality +PackDomain *string `json:"packDomain,omitempty"` +} +``` + +#### Fields Removed from Current API + +**From PodCliqueSetSpec:** + +- `ReplicaSpreadConstraints []corev1.TopologySpreadConstraint` - Removed (spread not supported) + +**From PodCliqueSetTemplateSpec:** + +- `SchedulingPolicyConfig *SchedulingPolicyConfig` - Removed (replaced by TopologyConstraint) + +**Types Removed:** + +- `SchedulingPolicyConfig` struct - Removed entirely + - `NetworkPackGroupConfig` struct - Removed entirely + +#### PodCliqueSet CRD Extensions + +```go +type PodCliqueSetTemplateSpec struct { +// ... existing fields ... + +// TopologyConstraint defines topology placement requirements for PodCliqueSet +// Immutable after resource creation +// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` +} +``` + +#### PodCliqueScalingGroup CRD Extensions + +```go +type PodCliqueScalingGroupConfig struct { +// ... existing fields ... + +// TopologyConstraint defines topology placement requirements for PodCliqueScalingGroup +// Must be equal to or stricter than parent PodCliqueSet constraints +// Immutable after resource creation +// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` +} +``` + +#### PodClique CRD Extensions + +```go +type PodCliqueTemplateSpec struct { +// ... existing fields ... + +// TopologyConstraint defines topology placement requirements for PodClique +// Must be equal to or stricter than parent resource constraints +// Immutable after resource creation +// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` +} +``` + +#### Validation Webhook + +The validation webhook ensures topology configuration consistency: + +**TopologyDomain Reference:** + +- TopologyDomain specified in operator's `--topology-domain-name` must exist + - Referenced PackDomain name must exist in TopologyDomain.Spec.Levels + - All validation performed against operator-configured TopologyDomain + +**Hierarchy Constraints:** + +- Child resource PackDomain must be equal to or stricter than parent + - PodCliqueSet → PodCliqueScalingGroup → PodClique hierarchy + - Stricter = higher index (narrower scope) in TopologyDomain.Spec.Levels + - Example: If parent uses "zone" (index 1), child can use "zone", "rack" (index 4), or "host" (index 5) + +**Immutability:** + +- All TopologyConstraint fields immutable after resource creation + - Domain hierarchy relationships cannot change after creation + +### 3. Scheduler API Changes (Contract with KAI) + +#### PodGang CRD Extensions + +The Grove Operator translates topology configuration into Grove Scheduler API format, which serves as the contract with +KAI scheduler. + +**PodGangSpec:** + +```go +type PodGangSpec struct { +// PodGroups is a list of member pod groups in the PodGang +PodGroups []PodGroup `json:"podgroups"` + +// TopologyRef references the Kueue Topology resource +// Points to Kueue Topology CRD auto-generated by TopologyDomain controller +// +optional +TopologyRef *NamespacedName `json:"topologyRef,omitempty"` + +// TopologyConstraint defines topology packing constraints for entire pod gang +// Translated from PodCliqueSet.TopologyConstraint +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` + +// NetworkPackGroupConfigs defines groups of PodGroups for network optimization +// Enhanced with topology constraints for PCSG-level packing +// +optional +NetworkPackGroupConfigs []NetworkPackGroupConfig `json:"networkPackGroupConfigs,omitempty"` + +// PriorityClassName is the name of the PriorityClass for the PodGang +PriorityClassName string `json:"priorityClassName,omitempty"` +} +``` + +**NetworkPackGroupConfig:** + +```go +// NetworkPackGroupConfig indicates PodGroups should be optimally placed w.r.t cluster's network topology +type NetworkPackGroupConfig struct { +// PodGroupNames is the list of PodGroup names in the network pack group +PodGroupNames []string `json:"podGroupNames"` + +// TopologyConstraint defines topology packing constraints for this group +// Enables PCSG-level topology constraints +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` +} +``` + +**PodGroup:** + +```go +type PodGroup struct { +// Name is the name of the PodGroup +Name string `json:"name"` + +// PodReferences is a list of references to the Pods in this group +PodReferences []NamespacedName `json:"podReferences"` + +// MinReplicas is the number of replicas that needs to be gang scheduled +MinReplicas int32 `json:"minReplicas"` + +// TopologyConstraint defines topology packing constraints for this PodGroup +// Enables PodClique-level topology constraints +// +optional +TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` +} +``` + +**Supporting Types:** + +```go +type NamespacedName struct { +Namespace string `json:"namespace,omitempty"` +Name string `json:"name"` +} + +type TopologyConstraint struct { +// Required defines topology constraint that must be satisfied +// Populated from user's packDomain specification in operator API +// +optional +Required *PackConstraint `json:"required,omitempty"` + +// Preferred defines best-effort topology constraint +// Auto-generated by operator using strictest level for optimization +// Scheduler can fallback to less strict levels if preferred cannot be satisfied +// +optional +Preferred *PackConstraint `json:"preferred,omitempty"` +} + +type PackConstraint struct { +// PackDomain references a level name from TopologyDomain.Spec.Levels +PackDomain string `json:"packDomain"` +} +``` + +**Changes Summary:** + +Fields Added: + +- `PodGangSpec.TopologyRef *NamespacedName` - References Kueue Topology CRD (optional pointer) + - `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) + - `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( + optional pointer) + - `PodGroup.TopologyConstraint *TopologyConstraint` - PodClique-level packing from PodClique (optional pointer) + +Fields Removed: + +- `PodGangSpec.SpreadConstraints` - Not implemented; spread will be part of TopologyConstraint in future + +**Note:** All TopologyConstraint fields are pointers with omitempty, allowing workloads without topology constraints. + +#### Translation Logic + +The operator translates Grove operator API to Grove Scheduler API with three-level topology constraint hierarchy: + +**TopologyRef Population:** + +- Set to Kueue Topology resource name (matches TopologyDomain name from operator config) + - Example: operator config `--topology-domain-name=default` → `TopologyRef.Name="default"` + - KAI scheduler uses this to locate the Kueue Topology CRD + +**Constraint Translation (Required and Preferred):** + +The operator translates user's simple PackDomain into rich required/preferred structure in scheduler API: + +**Required Constraints:** + +- If user specifies `packDomain: "rack"` → becomes `TopologyConstraint.Required.PackDomain = "rack"` + - If user doesn't specify packDomain → `Required` is nil + - Applied at the appropriate level (PodGang, NetworkPackGroup, or PodGroup) + +**Preferred Constraints (Auto-Generated):** + +- Operator ALWAYS generates preferred constraint at all three levels + - Uses strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") + - Enables out-of-box optimization even without user configuration + - Scheduler can fallback to less strict levels if preferred cannot be satisfied + +**Three-Level Translation:** + +1. **PodGang Level** (from PodCliqueSet): + - `PodGangSpec.TopologyConstraint.Required` ← user's `PodCliqueSet.TopologyConstraint.PackDomain` (if set) + - `PodGangSpec.TopologyConstraint.Preferred` ← auto-generated strictest level (e.g., "host") + + 2. **NetworkPackGroup Level** (from PodCliqueScalingGroup): + - For each PCSG with TopologyConstraint, create NetworkPackGroupConfig + - `NetworkPackGroupConfig.TopologyConstraint.Required` ← user's `PCSG.TopologyConstraint.PackDomain` (if set) + - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← auto-generated strictest level + + 3. **PodGroup Level** (from PodClique): + - `PodGroup.TopologyConstraint.Required` ← user's `PodClique.TopologyConstraint.PackDomain` (if set) + - `PodGroup.TopologyConstraint.Preferred` ← auto-generated strictest level + +**Example Translation:** + +User creates PodCliqueSet: + +```yaml +spec: + template: + topologyConstraint: + packDomain: "rack" # User specifies required constraint +``` + +Operator translates to PodGang: + +```yaml +spec: + topologyConstraint: + required: + packDomain: "rack" # From user + preferred: + packDomain: "host" # Auto-generated by operator +``` + +**Hierarchy Validation:** + +- Child required constraints must be equal or stricter than parent required constraints + - Preferred constraints always use strictest level at all levels + - PodGang > NetworkPackGroup > PodGroup hierarchy maintained + +## Component Architecture + +### Operator to Scheduler API Flow + +When a PodCliqueSet is created or updated, the Grove Operator translates it into Grove Scheduler API (PodGang CRD): + +**Step-by-Step Translation:** + +1. **PodCliqueSet Created/Updated**: + - User creates PodCliqueSet with optional `topologyConstraint.packDomain` + - Validation webhook validates against TopologyDomain + + 2. **Operator Reconciles PodCliqueSet**: + - Operator detects PodCliqueSet creation/update + - Loads TopologyDomain specified in operator config (`--topology-domain-name`) + - Prepares PodGang resource creation/update + + 3. **Build PodGang TopologyConstraint**: + - **Required**: From user's `PodCliqueSet.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated using strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") + - Populates `PodGangSpec.TopologyConstraint` + + 4. **Build NetworkPackGroupConfigs**: + - For each PodCliqueScalingGroup with TopologyConstraint in PodCliqueSet + - Create NetworkPackGroupConfig entry with PodGroupNames from that PCSG + - **Required**: From `PCSG.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated strictest level + - Populates `PodGangSpec.NetworkPackGroupConfigs` + + 5. **Build PodGroups with TopologyConstraint**: + - For each PodClique in PodCliqueSet, create corresponding PodGroup + - **Required**: From `PodClique.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated strictest level + - Populates `PodGroup.TopologyConstraint` for each PodGroup + + 6. **Set TopologyRef**: + - References Kueue Topology by name (matches TopologyDomain name from operator config) + - Example: `--topology-domain-name=default` → `TopologyRef.Name="default"` + - KAI scheduler uses this to locate the Kueue Topology CRD + + 7. **Create/Update PodGang in Scheduler API**: + - Operator calls Grove Scheduler API to create/update PodGang + - PodGang now has complete topology information at three levels + - KAI scheduler consumes PodGang and applies topology-aware scheduling + +**Key Points:** + +- Operator reconciliation performs translation + - Preferred constraints auto-generated at reconciliation time for out-of-box optimization + - Three-level hierarchy maintained: PodGang > NetworkPackGroup > PodGroup + - TopologyRef connects PodGang to KAI scheduler's required Kueue Topology + - All levels get both required (user-specified) and preferred (auto-generated) constraints + +### Topology-Aware Scheduling Flow + +High-level end-to-end flow: + +1. **Admin Setup**: Create TopologyDomain, configure operator + 2. **User Creates Workload**: PodCliqueSet with optional topology constraints + 3. **Validation**: Webhooks validate against TopologyDomain + 4. **Translation**: Operator builds PodGang with three-level constraints + 5. **Scheduling**: KAI scheduler applies topology constraints with fallback + +### Sequence Diagram + +``` +┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ PodCliqueSet │ │ Grove Operator │ │ Grove Scheduler │ │ Scheduler │ +│ │ │ │ │ API │ │ │ +└──────┬───────┘ └─────────┬────────┘ └────────┬────────┘ └────────┬────────┘ + │ │ │ │ + │ CREATE/UPDATE │ │ │ + ├─────────────────────▶│ │ │ + │ │ │ │ + │ │ 1. Mutation webhook │ │ + │ │ auto-populates │ │ + │ │ │ │ + │ │ 2. Validation webhook│ │ + │ │ validates against │ │ + │ │ TopologyDomain │ │ + │ │ │ │ + │ │ 3. Translate to │ │ + │ │ PodGang spec │ │ + │ │ │ │ + │ │ CREATE/UPDATE PodGang│ │ + │ ├─────────────────────▶│ │ + │ │ │ │ + │ │ │ SCHEDULE Pods │ + │ │ ├─────────────────────▶│ + │ │ │ │ + │ │ │ │ Apply topology + │ │ │ │ using Kueue + │ │ │ │ Topology CRD + │ │ │ │ +``` + +## Implementation Notes + +### Edge Cases + +**Case 1: PodCliqueSet Created Before TopologyDomain** + +- If TopologyConstraint not set: Creation allowed (non-topology workload) + - If TopologyConstraint.PackDomain set: Creation rejected (cannot validate without TopologyDomain) + +**Case 2: TopologyDomain Created After PodCliqueSet** + +- Existing workloads not affected (continue without topology) + - New workloads created after TopologyDomain can use topology constraints + - Topology constraints immutable at workload creation time + +**Design Principle:** Topology constraints established at creation remain immutable. Adding/removing TopologyDomain does +not retroactively affect existing workloads. + +### Resolved Design Questions + +This section documents key design decisions and their resolutions. + +**Q: How will cluster admins map Grove topology constants to physical topology labels?** + +**A: Resolved** - The `TopologyDomain` CRD provides the mapping mechanism. Admins create a TopologyDomain resource with +an ordered list of levels, where each level maps a friendly name (e.g., "rack") to a node label key (e.g., " +topology.kubernetes.io/rack"). This provides a clean, declarative API for topology configuration. + +**Q: Should we allow changes to cluster topology levels and mappings after creation?** + +**A: Resolved - No (Immutable)** - TopologyDomain and all TopologyConstraint fields are immutable after creation. This +prevents unpredictable behavior with in-flight workloads and maintains scheduling consistency. To change topology +configuration: + +1. Create a new TopologyDomain with updated configuration +2. Update operator's `--topology-domain-name` argument to reference new TopologyDomain +3. Drain or migrate existing workloads +4. Delete old TopologyDomain after all workloads are migrated + +**Q: If topology constraints cannot be satisfied, should workloads remain pending or schedule anyway?** + +**A: Resolved - Remain Pending** - For gang-scheduled workloads with topology constraints: + +- **Required Constraints** (user-specified `packDomain`): Must be satisfied; entire gang remains pending if unsatisfied + - **Preferred Constraints** (auto-generated): Best-effort optimization; scheduler can fall back to less strict + levels + - This behavior ensures workload integrity for tightly-coupled distributed inference workloads where partial + scheduling is ineffective + - Users relying on strict placement should use required constraints; users wanting flexibility should rely on + preferred constraints + +**Q: How will domain-level packing be realized in KAI scheduler?** + +**A: Contract Defined** - The `PodGang` CRD serves as the API contract between Grove operator and KAI scheduler. +Expected scheduler behavior: + +1. **Topology Resolution**: Scheduler reads `PodGang.spec.topologyRef` to locate Kueue Topology CRD +2. **Constraint Processing**: For each topology constraint (PodGang, NetworkPackGroup, PodGroup level): + - Process `required` constraints first (must satisfy) + - Apply `preferred` constraints as optimization hints (best-effort) +3. **Domain Filtering**: Filter cluster nodes to find topology domains (e.g., single rack, single host) that satisfy: + - Resource requests for all pods in the constraint scope + - Required topology level specified in constraint +4. **Placement**: Schedule all pods in the constraint scope within the chosen topology domain +5. **Fallback**: For preferred constraints, fall back to less strict topology levels if preferred level cannot be + satisfied +6. **Gang Semantics**: If required constraints cannot be satisfied, entire gang remains unscheduled (all-or-nothing) + +This contract ensures Grove workloads receive topology-aware placement while maintaining scheduler independence. + +## Security and RBAC + +The topology system requires careful RBAC configuration to ensure proper separation of concerns between cluster +administrators and the operator. + +### ClusterRole: Grove Operator + +The Grove operator requires read access to TopologyDomain and full management of Kueue Topology: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: grove-operator-topology +rules: + - apiGroups: [ "grove.run.ai" ] + resources: [ "topologydomains" ] + verbs: [ "get", "list", "watch" ] + - apiGroups: [ "kueue.x-k8s.io" ] + resources: [ "topologies" ] + verbs: [ "create", "delete", "get", "list", "watch", "update", "patch" ] +``` + +**Key Points:** + +- Topology configuration is a highly privileged operation restricted to cluster administrators + - Operator has read-only access to TopologyDomain to validate user workloads + - Operator manages Kueue Topology lifecycle automatically + - Users create PodCliqueSet with standard namespace-scoped permissions From 2c4bdd31f100cf44e05a63bff896b01570f9298c Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Tue, 21 Oct 2025 14:28:14 +0300 Subject: [PATCH 02/15] fix some structs issues Signed-off-by: Ron Kahn --- docs/designs/topology.md | 117 +++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 55 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 4b4b0070..24d73023 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -546,7 +546,7 @@ PodGroups []PodGroup `json:"podgroups"` // TopologyRef references the Kueue Topology resource // Points to Kueue Topology CRD auto-generated by TopologyDomain controller // +optional -TopologyRef *NamespacedName `json:"topologyRef,omitempty"` +TopologyRef *string `json:"topologyRef,omitempty"` // TopologyConstraint defines topology packing constraints for entire pod gang // Translated from PodCliqueSet.TopologyConstraint @@ -601,11 +601,6 @@ TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` **Supporting Types:** ```go -type NamespacedName struct { -Namespace string `json:"namespace,omitempty"` -Name string `json:"name"` -} - type TopologyConstraint struct { // Required defines topology constraint that must be satisfied // Populated from user's packDomain specification in operator API @@ -629,7 +624,7 @@ PackDomain string `json:"packDomain"` Fields Added: -- `PodGangSpec.TopologyRef *NamespacedName` - References Kueue Topology CRD (optional pointer) +- `PodGangSpec.TopologyRef *string` - References Kueue Topology CRD (optional pointer) - `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) - `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( optional pointer) @@ -777,53 +772,72 @@ High-level end-to-end flow: ### Sequence Diagram ``` -┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ PodCliqueSet │ │ Grove Operator │ │ Grove Scheduler │ │ Scheduler │ -│ │ │ │ │ API │ │ │ -└──────┬───────┘ └─────────┬────────┘ └────────┬────────┘ └────────┬────────┘ - │ │ │ │ - │ CREATE/UPDATE │ │ │ - ├─────────────────────▶│ │ │ - │ │ │ │ - │ │ 1. Mutation webhook │ │ - │ │ auto-populates │ │ - │ │ │ │ - │ │ 2. Validation webhook│ │ - │ │ validates against │ │ - │ │ TopologyDomain │ │ - │ │ │ │ - │ │ 3. Translate to │ │ - │ │ PodGang spec │ │ - │ │ │ │ - │ │ CREATE/UPDATE PodGang│ │ - │ ├─────────────────────▶│ │ - │ │ │ │ - │ │ │ SCHEDULE Pods │ - │ │ ├─────────────────────▶│ - │ │ │ │ - │ │ │ │ Apply topology - │ │ │ │ using Kueue - │ │ │ │ Topology CRD - │ │ │ │ +┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ PodCliqueSet │ │ Grove Operator │ │ Grove Scheduler │ │ Scheduler │ +│ │ │ │ │ API │ │ │ +└──────┬───────┘ └─────────┬────────┘ └────────┬────────┘ └────────┬────────┘ + │ │ │ │ + │ CREATE/UPDATE │ │ │ + ├─────────────────────▶│ │ │ + │ │ │ │ + │ │ 1. Validation webhook │ │ + │ │ validates against │ │ + │ │ TopologyDomain │ │ + │ │ │ │ + │ │ 2. Translate to │ │ + │ │ PodGang(s) spec │ │ + │ │ │ │ + │ │ CREATE/UPDATE PodGangs│ │ + │ ├─────────────────────▶ │ │ + │ │ │ │ + │ │ │ SCHEDULE Pods │ + │ │ ├─────────────────────▶│ + │ │ │ │ + │ │ │ │ Apply topology + │ │ │ │ using Kueue + │ │ │ │ Topology CRD + │ │ │ │ ``` ## Implementation Notes ### Edge Cases -**Case 1: PodCliqueSet Created Before TopologyDomain** +**Case 1: TopologyDomain Not Configured** + +- If `--topology-domain-name` argument not provided to operator: topology features completely disabled +- PodCliqueSet workloads without `packDomain` function normally +- PodCliqueSet workloads with `packDomain` specified: validation webhook rejects creation (cannot validate without + TopologyDomain) +- No auto-optimization (preferred constraints) applied + +**Case 2: TopologyDomain Configured but Missing at Startup** -- If TopologyConstraint not set: Creation allowed (non-topology workload) - - If TopologyConstraint.PackDomain set: Creation rejected (cannot validate without TopologyDomain) +- If `--topology-domain-name` argument provided but TopologyDomain resource doesn't exist: operator fails to start +- Operator requires TopologyDomain to exist for auto-optimization +- Admin must either: + - Create the referenced TopologyDomain resource, OR + - Remove `--topology-domain-name` argument from operator configuration -**Case 2: TopologyDomain Created After PodCliqueSet** +**Case 3: TopologyDomain Deleted During Runtime** -- Existing workloads not affected (continue without topology) - - New workloads created after TopologyDomain can use topology constraints - - Topology constraints immutable at workload creation time +- Finalizer prevents deletion while any PodCliqueSet resources exist +- If TopologyDomain deleted after all PodCliqueSet resources removed: + - Operator blocks creation of ALL new workloads (topology and non-topology) + - Existing workloads continue to function (already scheduled) +- Admin must either: + - Create new TopologyDomain resource with same name, OR + - Remove `--topology-domain-name` argument and restart operator -**Design Principle:** Topology constraints established at creation remain immutable. Adding/removing TopologyDomain does -not retroactively affect existing workloads. +**Case 4: Topology Features Enabled/Disabled** + +- **Enabled**: When `--topology-domain-name` provided and TopologyDomain exists + - Auto-optimization active for all workloads (preferred constraints generated) + - User-specified `packDomain` validated and enforced as required constraints +- **Disabled**: When `--topology-domain-name` argument not provided + - Topology constraints in workload CRDs ignored duri ng scheduling + - Workloads schedule without topology awareness +- **Toggling**: Cannot enable/disable during runtime - requires operator restart with updated configuration ### Resolved Design Questions @@ -831,13 +845,13 @@ This section documents key design decisions and their resolutions. **Q: How will cluster admins map Grove topology constants to physical topology labels?** -**A: Resolved** - The `TopologyDomain` CRD provides the mapping mechanism. Admins create a TopologyDomain resource with +**A: The `TopologyDomain` CRD provides the mapping mechanism. Admins create a TopologyDomain resource with an ordered list of levels, where each level maps a friendly name (e.g., "rack") to a node label key (e.g., " topology.kubernetes.io/rack"). This provides a clean, declarative API for topology configuration. **Q: Should we allow changes to cluster topology levels and mappings after creation?** -**A: Resolved - No (Immutable)** - TopologyDomain and all TopologyConstraint fields are immutable after creation. This +**A: No (Immutable)** - TopologyDomain and all TopologyConstraint fields are immutable after creation. This prevents unpredictable behavior with in-flight workloads and maintains scheduling consistency. To change topology configuration: @@ -848,7 +862,7 @@ configuration: **Q: If topology constraints cannot be satisfied, should workloads remain pending or schedule anyway?** -**A: Resolved - Remain Pending** - For gang-scheduled workloads with topology constraints: +**A: Remain Pending** - For gang-scheduled workloads with topology constraints: - **Required Constraints** (user-specified `packDomain`): Must be satisfied; entire gang remains pending if unsatisfied - **Preferred Constraints** (auto-generated): Best-effort optimization; scheduler can fall back to less strict @@ -863,7 +877,7 @@ configuration: **A: Contract Defined** - The `PodGang` CRD serves as the API contract between Grove operator and KAI scheduler. Expected scheduler behavior: -1. **Topology Resolution**: Scheduler reads `PodGang.spec.topologyRef` to locate Kueue Topology CRD +1. **Topology Resolution**: KAI Pod Grouper reads `PodGang.spec.topologyRef` to locate Kueue Topology CRD 2. **Constraint Processing**: For each topology constraint (PodGang, NetworkPackGroup, PodGroup level): - Process `required` constraints first (must satisfy) - Apply `preferred` constraints as optimization hints (best-effort) @@ -899,10 +913,3 @@ rules: resources: [ "topologies" ] verbs: [ "create", "delete", "get", "list", "watch", "update", "patch" ] ``` - -**Key Points:** - -- Topology configuration is a highly privileged operation restricted to cluster administrators - - Operator has read-only access to TopologyDomain to validate user workloads - - Operator manages Kueue Topology lifecycle automatically - - Users create PodCliqueSet with standard namespace-scoped permissions From d3c6f6b58cd9bc980c78e4f73c9fcf64afb558fc Mon Sep 17 00:00:00 2001 From: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com> Date: Tue, 21 Oct 2025 20:46:25 +0300 Subject: [PATCH 03/15] Update docs/designs/topology.md typo fix Co-authored-by: Roman Baron <91824211+romanbaron@users.noreply.github.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com> --- docs/designs/topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 24d73023..d2ef5ec2 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -2,7 +2,7 @@ ## Overview -This document defines the design for implementing topology-aware scheduling in the Grove operator.. +This document defines the design for implementing topology-aware scheduling in the Grove operator. **Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these applications require: From f7a0f2f67f5d77b8fdf4463476332ac0626baf7c Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Tue, 21 Oct 2025 20:56:21 +0300 Subject: [PATCH 04/15] docs: improve network locality description in topology design Signed-off-by: Ron Kahn --- docs/designs/topology.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index d2ef5ec2..084323dd 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -7,10 +7,11 @@ This document defines the design for implementing topology-aware scheduling in t **Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these applications require: -- **Network Locality**: High-bandwidth communication between prefill and decode workers benefits from proximity +- **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( + prefill and decode, etc) - **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same topology domain - - **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-end + - **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-ends performance **Design Approach**: This design introduces a flexible topology system with three main components: From 8b7055beeb320271a232ddf8a83de8b233af75ac Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 22 Oct 2025 18:04:21 +0300 Subject: [PATCH 05/15] CR Fixes Signed-off-by: Ron Kahn --- docs/designs/topology.md | 361 ++++++++++++++++++++------------------- 1 file changed, 182 insertions(+), 179 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 084323dd..9b362b3b 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -2,46 +2,48 @@ ## Overview -This document defines the design for implementing topology-aware scheduling in the Grove operator. +This document defines the design for supporting topology-aware scheduling in the Grove operator. **Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these applications require: - **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( - prefill and decode, etc) - - **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same - topology domain - - **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-ends - performance + prefill and decode, etc.) +- **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same + topology domain +- **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-end + performance **Design Approach**: This design introduces a flexible topology system with three main components: -1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names to node labels - 2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument - 3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, - PodClique) +1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names (e.g., rack, zone, host) + to node labels +2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument +3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, + PodClique) -**Key Feature**: Grove provides automatic out-of-box topology optimization by generating preferred packing constraints -at all levels, even without user configuration. Users can optionally specify required constraints for strict placement -requirements. +**Key Feature**: Grove attempts automatic out-of-box topology optimization by generating preferred (best-effort) packing +constraints at all levels, even without user configuration. This opportunistic packing may improve performance when +cluster resources allow, but users should specify required constraints when strict placement is critical for their +workload. ## Goals - Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD - - Enable packing constraints for network locality across all Grove scalable resources - - Support multiple topology configurations for different environments - - Automatic Kueue Topology generation for KAI scheduler integration - - Immutable topology configuration ensuring scheduling consistency - - Hierarchical constraint validation (child stricter than parent) +- Enable packing constraints for network locality across all Grove scalable resources +- Support multiple topology configurations for different environments +- Automatic Kueue Topology generation for KAI scheduler integration +- Immutable topology configuration ensuring scheduling consistency +- Hierarchical constraint validation (child stricter than parent) ## Non-Goals - Spread constraints across topology domains (ReplicaSpreadDomain) - - Root domain constraints for entire resource (RootDomain) - - Ratio-based affinity groups between scaling groups (AffinityGroups with PackRatio) - - Dynamic topology reconfiguration after creation - - Per-workload topology domain selection - - Automatic topology inference from workload characteristics +- Root domain constraints for entire resource (RootDomain) +- Ratio-based affinity groups between scaling groups (AffinityGroups with PackRatio) +- Dynamic topology reconfiguration after creation +- Per-workload topology domain selection +- Automatic topology inference from workload characteristics ## Proposal @@ -51,18 +53,19 @@ Grove implements topology-aware scheduling through three key components: 1. **TopologyDomain CRD**: Cluster-scoped resource defining topology hierarchy - Admin creates TopologyDomain with ordered list of topology levels - - Each level maps friendly name (e.g., "rack") to node label key (e.g., "topology.kubernetes.io/rack") + - Each level maps friendly name (e.g., "rack", "zone", "host") to node label key (e.g., " + topology.kubernetes.io/rack") - Multiple TopologyDomains supported for different environments - 2. **Operator Configuration**: References TopologyDomain by name - - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use - - All workload validation performed against configured TopologyDomain - - Enables switching between topologies without changing workloads +2. **Operator Configuration**: References TopologyDomain by name + - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use + - All workload validation performed against configured TopologyDomain + - Enables switching between topologies without changing workloads - 3. **Workload API (TopologyConstraint)**: Users specify packing requirements - - PodCliqueSet, PodCliqueScalingGroup, and PodClique each have TopologyConstraint field - - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) - - No direct TopologyDomain reference needed in workloads +3. **Workload API (TopologyConstraint)**: Users specify packing requirements + - PodCliqueSet, PodCliqueScalingGroup, and PodClique each have TopologyConstraint field + - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) + - No direct TopologyDomain reference needed in workloads ### Component Interactions @@ -83,21 +86,21 @@ PodCliqueSet ────────┘ **Out-of-Box Optimization:** - Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") - - Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API - - Users get optimal packing without configuration +- Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API +- Users get optimal packing without configuration **User Control:** - Users can specify **required** constraints via `packDomain` for strict placement requirements - - Required constraints validated and must be satisfied - - Preferred constraints enable best-effort optimization with graceful fallback +- Required constraints validated and must be satisfied +- Preferred constraints enable best-effort optimization with graceful fallback ### Controller Responsibilities The TopologyDomain controller manages: - **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration - - **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it +- **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it ## Out of Scope @@ -105,14 +108,12 @@ The following features are explicitly out of scope for this design: - **Spread Constraints**: ReplicaSpreadDomain for distributing replicas across domains for fault tolerance is not supported - - **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a - topology domain is not supported - - **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + - 1 - Decode ratios) is not supported - - **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, - and - inference requirements +- **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a + topology domain is not supported +- **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + 1 + Decode ratios) is not supported +- **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, and + inference requirements ## Design Details @@ -126,7 +127,7 @@ The following features are explicitly out of scope for this design: │ Admin Layer: │ │ ┌──────────────────┐ ┌────────────────────┐ │ │ │ TopologyDomain │─────────────▶│ TopologyDomain │ │ -│ │ CRD │ │ Controller │ │ +│ │ CR │ │ Controller │ │ │ │ (levels list) │ └─────────┬──────────┘ │ │ └──────────────────┘ │ │ │ │ │ │ @@ -163,17 +164,17 @@ The following features are explicitly out of scope for this design: ### 1. TopologyDomain Infrastructure -#### TopologyDomain CRD +#### TopologyDomain CR -TopologyDomain is a cluster-scoped CRD that defines the topology hierarchy for scheduling. It maps friendly level names +TopologyDomain is a cluster-scoped CR that defines the topology hierarchy for scheduling. It maps friendly level names to Kubernetes node labels and establishes ordering from broadest to narrowest scope. **Characteristics:** - **Cluster-scoped**: Multiple TopologyDomains can exist - - **Operator-selected**: Operator references one by name via `--topology-domain-name` argument - - **Immutable**: Once created, cannot be modified - - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) +- **Operator-selected**: Operator references one by name via `--topology-domain-name` argument +- **Immutable**: Once created, cannot be modified +- **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) **API Structure:** @@ -261,44 +262,43 @@ spec: Steps: 1. Install Grove: `helm install grove` - 2. Customize example above with your cluster's actual `topologyKey` values - 3. Create resource: `kubectl apply -f topologydomain.yaml` - 4. Configure operator with `--topology-domain-name` matching the resource name - 5. Create workloads with topology constraints +2. Customize example above with your cluster's actual `topologyKey` values +3. Create resource: `kubectl apply -f topologydomain.yaml` +4. Configure operator with `--topology-domain-name` matching the resource name +5. Create workloads with topology constraints Notes: - TopologyDomain becomes immutable after creation - - Multiple TopologyDomains can exist; operator uses the one specified in its argument - - Ensure node labels exist on cluster nodes before creating workloads - - List order defines hierarchy: index 0 = broadest, last = narrowest - - Example hierarchy: `region` (0) > `zone` (1) > `datacenter` (2) > `block` (3) > `rack` (4) > `host` (5) > `numa` ( - 6) +- Multiple TopologyDomains can exist; operator uses the one specified in its argument +- Ensure node labels exist on cluster nodes before creating workloads +- List order defines hierarchy: index 0 = broadest, last = narrowest +- Example hierarchy: `region` (0) > `zone` (1) > `datacenter` (2) > `block` (3) > `rack` (4) > `host` (5) > `numa` (6) **Validation:** CRD-Level: - At least one level required (minimum 1, maximum 10) - - Level `name` required (max 63 chars) - - Level `topologyKey` required (max 316 chars) - - Level `description` optional (max 1024 chars) - - Entire levels list immutable after creation +- Level `name` required (max 63 chars) +- Level `topologyKey` required (max 316 chars) +- Level `description` optional (max 1024 chars) +- Entire levels list immutable after creation Webhook: - Each level `name` must be unique within the `levels` array of a single TopologyDomain - - Each `topologyKey` must be unique within the `levels` array of a single TopologyDomain - - Cannot modify any field after creation - - Deletion protection via controller finalizer +- Each `topologyKey` must be unique within the `levels` array of a single TopologyDomain +- Cannot modify any field after creation +- Deletion protection via controller finalizer **Node Label Responsibility:** - Cluster administrators are responsible for ensuring that node labels specified in `topologyKey` fields exist on cluster nodes - - TopologyDomain creation succeeds even if labels don't exist yet (allows pre-configuration) - - Workloads may fail to schedule if referenced topology labels are missing from nodes - - Administrators should verify node labels match TopologyDomain configuration before creating workloads +- TopologyDomain creation succeeds even if labels don't exist yet (allows pre-configuration) +- Workloads may fail to schedule if referenced topology labels are missing from nodes +- Administrators should verify node labels match TopologyDomain configuration before creating workloads #### TopologyDomain Controller @@ -315,19 +315,19 @@ Grove uses its own TopologyDomain CRD for user-friendly admin/user API, but KAI Topology CRD format for actual scheduling operations. The TopologyDomain controller bridges this gap by: - Reading Grove's TopologyDomain (user-friendly with level names like "rack", "zone") - - Automatically generating Kueue Topology (KAI scheduler's required format with node labels only) - - Maintaining consistency between both representations - - Eliminating manual coordination for admins +- Automatically generating Kueue Topology (KAI scheduler's required format with node labels only) +- Maintaining consistency between both representations +- Eliminating manual coordination for admins This separation allows Grove to provide better UX while maintaining compatibility with KAI scheduler requirements. Generation Process: 1. Controller watches TopologyDomain specified in operator argument - 2. When TopologyDomain created, controller creates matching Kueue Topology - 3. Kueue Topology name matches TopologyDomain name - 4. Levels extracted from TopologyDomain.Spec.Levels using topologyKey field - 5. Order preserved from TopologyDomain list +2. When TopologyDomain created, controller creates matching Kueue Topology +3. Kueue Topology name matches TopologyDomain name +4. Levels extracted from TopologyDomain.Spec.Levels using topologyKey field +5. Order preserved from TopologyDomain list Example: @@ -353,9 +353,9 @@ spec: Key Points: - Admin only creates TopologyDomain; Kueue Topology auto-generated - - Owner reference ensures Kueue Topology deleted with TopologyDomain - - Same name for both resources - - No manual coordination required +- Owner reference ensures Kueue Topology deleted with TopologyDomain +- Same name for both resources +- No manual coordination required **Implementation Note:** @@ -370,24 +370,24 @@ Prevents TopologyDomain deletion while PodCliqueSet resources reference it using Deletion Workflow: 1. Admin runs `kubectl delete topologydomain default` - 2. Kubernetes blocks deletion (finalizer `grove.run.ai/topology-protection` present) - 3. Controller reconciles: - - Detects deletion request (deletion timestamp set) - - Scans cluster for any PodCliqueSet resources - - If PodCliqueSet exists: Keeps finalizer, deletion blocked - - If no PodCliqueSet exists: Removes finalizer, deletion proceeds - 4. Once finalizer removed, Kubernetes deletes TopologyDomain +2. Kubernetes blocks deletion (finalizer `grove.run.ai/topology-protection` present) +3. Controller reconciles: + - Detects deletion request (deletion timestamp set) + - Scans cluster for any PodCliqueSet resources + - If PodCliqueSet exists: Keeps finalizer, deletion blocked + - If no PodCliqueSet exists: Removes finalizer, deletion proceeds +4. Once finalizer removed, Kubernetes deletes TopologyDomain Why Only Check PodCliqueSet: - Grove ownership hierarchy: PodCliqueSet owns PodCliqueScalingGroup and PodClique - - If no PodCliqueSet exists, other resources cannot exist +- If no PodCliqueSet exists, other resources cannot exist Key Points: - Admin must delete all PodCliqueSet before deleting TopologyDomain - - Controller continuously reconciles - - Prevents orphaned workloads with invalid topology references +- Controller continuously reconciles +- Prevents orphaned workloads with invalid topology references #### Operator Configuration @@ -410,8 +410,8 @@ spec: Configuration: - `--topology-domain-name`: Specifies TopologyDomain resource name for validation - - Operator loads referenced TopologyDomain at startup - - All PodCliqueSet topology constraints validated against this TopologyDomain +- Operator loads referenced TopologyDomain at startup +- All PodCliqueSet topology constraints validated against this TopologyDomain **Runtime Behavior:** @@ -433,8 +433,8 @@ When TopologyDomain is Missing or Deleted: **Multiple Topologies:** - Multiple TopologyDomain resources can exist (e.g., "aws-topology", "on-prem-topology") - - Operator argument selects which one to use - - Enables different topology configurations per environment +- Operator argument selects which one to use +- Enables different topology configurations per environment ### 2. Operator API Changes (Grove CRDs) @@ -462,7 +462,7 @@ PackDomain *string `json:"packDomain,omitempty"` **Types Removed:** - `SchedulingPolicyConfig` struct - Removed entirely - - `NetworkPackGroupConfig` struct - Removed entirely +- `NetworkPackGroupConfig` struct - Removed entirely #### PodCliqueSet CRD Extensions @@ -471,8 +471,6 @@ type PodCliqueSetTemplateSpec struct { // ... existing fields ... // TopologyConstraint defines topology placement requirements for PodCliqueSet -// Immutable after resource creation -// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` } @@ -486,8 +484,6 @@ type PodCliqueScalingGroupConfig struct { // TopologyConstraint defines topology placement requirements for PodCliqueScalingGroup // Must be equal to or stricter than parent PodCliqueSet constraints -// Immutable after resource creation -// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` } @@ -501,8 +497,6 @@ type PodCliqueTemplateSpec struct { // TopologyConstraint defines topology placement requirements for PodClique // Must be equal to or stricter than parent resource constraints -// Immutable after resource creation -// +kubebuilder:validation:XValidation:rule="self==oldSelf",message="topology constraints are immutable" // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` } @@ -515,20 +509,17 @@ The validation webhook ensures topology configuration consistency: **TopologyDomain Reference:** - TopologyDomain specified in operator's `--topology-domain-name` must exist - - Referenced PackDomain name must exist in TopologyDomain.Spec.Levels - - All validation performed against operator-configured TopologyDomain +- Referenced PackDomain name must exist in TopologyDomain.Spec.Levels +- All validation performed against operator-configured TopologyDomain **Hierarchy Constraints:** - Child resource PackDomain must be equal to or stricter than parent - - PodCliqueSet → PodCliqueScalingGroup → PodClique hierarchy - - Stricter = higher index (narrower scope) in TopologyDomain.Spec.Levels - - Example: If parent uses "zone" (index 1), child can use "zone", "rack" (index 4), or "host" (index 5) - -**Immutability:** - -- All TopologyConstraint fields immutable after resource creation - - Domain hierarchy relationships cannot change after creation +- PodCliqueSet → PodCliqueScalingGroup → PodClique hierarchy +- Stricter = higher index (narrower scope) in TopologyDomain.Spec.Levels +- Example: If parent uses "zone" (index 1), child can use "zone", "rack" (index 4), or "host" (index 5) +- Validation applies on both CREATE and UPDATE operations +- During updates, hierarchy constraints are re-validated to ensure child remains equal or stricter than parent ### 3. Scheduler API Changes (Contract with KAI) @@ -551,11 +542,13 @@ TopologyRef *string `json:"topologyRef,omitempty"` // TopologyConstraint defines topology packing constraints for entire pod gang // Translated from PodCliqueSet.TopologyConstraint +// Updated by operator on each reconciliation when PodCliqueSet topology constraints change // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` // NetworkPackGroupConfigs defines groups of PodGroups for network optimization // Enhanced with topology constraints for PCSG-level packing +// Updated by operator on each reconciliation when PCSG topology constraints change // +optional NetworkPackGroupConfigs []NetworkPackGroupConfig `json:"networkPackGroupConfigs,omitempty"` @@ -574,6 +567,7 @@ PodGroupNames []string `json:"podGroupNames"` // TopologyConstraint defines topology packing constraints for this group // Enables PCSG-level topology constraints +// Updated by operator when PodCliqueScalingGroup topology constraints change // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` } @@ -594,6 +588,7 @@ MinReplicas int32 `json:"minReplicas"` // TopologyConstraint defines topology packing constraints for this PodGroup // Enables PodClique-level topology constraints +// Updated by operator when PodClique topology constraints change // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` } @@ -626,10 +621,10 @@ PackDomain string `json:"packDomain"` Fields Added: - `PodGangSpec.TopologyRef *string` - References Kueue Topology CRD (optional pointer) - - `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) - - `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( - optional pointer) - - `PodGroup.TopologyConstraint *TopologyConstraint` - PodClique-level packing from PodClique (optional pointer) +- `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) +- `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( + optional pointer) +- `PodGroup.TopologyConstraint *TopologyConstraint` - PodClique-level packing from PodClique (optional pointer) Fields Removed: @@ -644,8 +639,8 @@ The operator translates Grove operator API to Grove Scheduler API with three-lev **TopologyRef Population:** - Set to Kueue Topology resource name (matches TopologyDomain name from operator config) - - Example: operator config `--topology-domain-name=default` → `TopologyRef.Name="default"` - - KAI scheduler uses this to locate the Kueue Topology CRD +- Example: operator config `--topology-domain-name=default` → `TopologyRef.Name="default"` +- KAI scheduler uses this to locate the Kueue Topology CRD **Constraint Translation (Required and Preferred):** @@ -654,15 +649,15 @@ The operator translates user's simple PackDomain into rich required/preferred st **Required Constraints:** - If user specifies `packDomain: "rack"` → becomes `TopologyConstraint.Required.PackDomain = "rack"` - - If user doesn't specify packDomain → `Required` is nil - - Applied at the appropriate level (PodGang, NetworkPackGroup, or PodGroup) +- If user doesn't specify packDomain → `Required` is nil +- Applied at the appropriate level (PodGang, NetworkPackGroup, or PodGroup) **Preferred Constraints (Auto-Generated):** - Operator ALWAYS generates preferred constraint at all three levels - - Uses strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") - - Enables out-of-box optimization even without user configuration - - Scheduler can fallback to less strict levels if preferred cannot be satisfied +- Uses strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") +- Enables out-of-box optimization even without user configuration +- Scheduler can fallback to less strict levels if preferred cannot be satisfied **Three-Level Translation:** @@ -670,14 +665,14 @@ The operator translates user's simple PackDomain into rich required/preferred st - `PodGangSpec.TopologyConstraint.Required` ← user's `PodCliqueSet.TopologyConstraint.PackDomain` (if set) - `PodGangSpec.TopologyConstraint.Preferred` ← auto-generated strictest level (e.g., "host") - 2. **NetworkPackGroup Level** (from PodCliqueScalingGroup): - - For each PCSG with TopologyConstraint, create NetworkPackGroupConfig - - `NetworkPackGroupConfig.TopologyConstraint.Required` ← user's `PCSG.TopologyConstraint.PackDomain` (if set) - - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← auto-generated strictest level +2. **NetworkPackGroup Level** (from PodCliqueScalingGroup): + - For each PCSG with TopologyConstraint, create NetworkPackGroupConfig + - `NetworkPackGroupConfig.TopologyConstraint.Required` ← user's `PCSG.TopologyConstraint.PackDomain` (if set) + - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← auto-generated strictest level - 3. **PodGroup Level** (from PodClique): - - `PodGroup.TopologyConstraint.Required` ← user's `PodClique.TopologyConstraint.PackDomain` (if set) - - `PodGroup.TopologyConstraint.Preferred` ← auto-generated strictest level +3. **PodGroup Level** (from PodClique): + - `PodGroup.TopologyConstraint.Required` ← user's `PodClique.TopologyConstraint.PackDomain` (if set) + - `PodGroup.TopologyConstraint.Preferred` ← auto-generated strictest level **Example Translation:** @@ -704,8 +699,16 @@ spec: **Hierarchy Validation:** - Child required constraints must be equal or stricter than parent required constraints - - Preferred constraints always use strictest level at all levels - - PodGang > NetworkPackGroup > PodGroup hierarchy maintained +- Preferred constraints always use strictest level at all levels +- PodGang > NetworkPackGroup > PodGroup hierarchy maintained + +**Mutable Topology Constraints:** + +- Users can update topology constraints at any time (PodCliqueSet, PodCliqueScalingGroup, PodClique levels) +- Constraint changes only affect new or unscheduled pods +- Already scheduled pods retain their current placement and are not rescheduled +- Operator re-translates constraints to PodGang on each reconciliation triggered by updates +- Useful for adjusting placement requirements when workloads fail to schedule due to resource constraints ## Component Architecture @@ -719,56 +722,56 @@ When a PodCliqueSet is created or updated, the Grove Operator translates it into - User creates PodCliqueSet with optional `topologyConstraint.packDomain` - Validation webhook validates against TopologyDomain - 2. **Operator Reconciles PodCliqueSet**: - - Operator detects PodCliqueSet creation/update - - Loads TopologyDomain specified in operator config (`--topology-domain-name`) - - Prepares PodGang resource creation/update - - 3. **Build PodGang TopologyConstraint**: - - **Required**: From user's `PodCliqueSet.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated using strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") - - Populates `PodGangSpec.TopologyConstraint` - - 4. **Build NetworkPackGroupConfigs**: - - For each PodCliqueScalingGroup with TopologyConstraint in PodCliqueSet - - Create NetworkPackGroupConfig entry with PodGroupNames from that PCSG - - **Required**: From `PCSG.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated strictest level - - Populates `PodGangSpec.NetworkPackGroupConfigs` - - 5. **Build PodGroups with TopologyConstraint**: - - For each PodClique in PodCliqueSet, create corresponding PodGroup - - **Required**: From `PodClique.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated strictest level - - Populates `PodGroup.TopologyConstraint` for each PodGroup - - 6. **Set TopologyRef**: - - References Kueue Topology by name (matches TopologyDomain name from operator config) - - Example: `--topology-domain-name=default` → `TopologyRef.Name="default"` - - KAI scheduler uses this to locate the Kueue Topology CRD - - 7. **Create/Update PodGang in Scheduler API**: - - Operator calls Grove Scheduler API to create/update PodGang - - PodGang now has complete topology information at three levels - - KAI scheduler consumes PodGang and applies topology-aware scheduling +2. **Operator Reconciles PodCliqueSet**: + - Operator detects PodCliqueSet creation/update + - Loads TopologyDomain specified in operator config (`--topology-domain-name`) + - Prepares PodGang resource creation/update + +3. **Build PodGang TopologyConstraint**: + - **Required**: From user's `PodCliqueSet.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated using strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") + - Populates `PodGangSpec.TopologyConstraint` + +4. **Build NetworkPackGroupConfigs**: + - For each PodCliqueScalingGroup with TopologyConstraint in PodCliqueSet + - Create NetworkPackGroupConfig entry with PodGroupNames from that PCSG + - **Required**: From `PCSG.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated strictest level + - Populates `PodGangSpec.NetworkPackGroupConfigs` + +5. **Build PodGroups with TopologyConstraint**: + - For each PodClique in PodCliqueSet, create corresponding PodGroup + - **Required**: From `PodClique.topologyConstraint.packDomain` (if specified) + - **Preferred**: Auto-generated strictest level + - Populates `PodGroup.TopologyConstraint` for each PodGroup + +6. **Set TopologyRef**: + - References Kueue Topology by name (matches TopologyDomain name from operator config) + - Example: `--topology-domain-name=default` → `TopologyRef.Name="default"` + - KAI scheduler uses this to locate the Kueue Topology CRD + +7. **Create/Update PodGang in Scheduler API**: + - Operator calls Grove Scheduler API to create/update PodGang + - PodGang now has complete topology information at three levels + - KAI scheduler consumes PodGang and applies topology-aware scheduling **Key Points:** - Operator reconciliation performs translation - - Preferred constraints auto-generated at reconciliation time for out-of-box optimization - - Three-level hierarchy maintained: PodGang > NetworkPackGroup > PodGroup - - TopologyRef connects PodGang to KAI scheduler's required Kueue Topology - - All levels get both required (user-specified) and preferred (auto-generated) constraints +- Preferred constraints auto-generated at reconciliation time for out-of-box optimization +- Three-level hierarchy maintained: PodGang > NetworkPackGroup > PodGroup +- TopologyRef connects PodGang to KAI scheduler's required Kueue Topology +- All levels get both required (user-specified) and preferred (auto-generated) constraints ### Topology-Aware Scheduling Flow High-level end-to-end flow: 1. **Admin Setup**: Create TopologyDomain, configure operator - 2. **User Creates Workload**: PodCliqueSet with optional topology constraints - 3. **Validation**: Webhooks validate against TopologyDomain - 4. **Translation**: Operator builds PodGang with three-level constraints - 5. **Scheduling**: KAI scheduler applies topology constraints with fallback +2. **User Creates Workload**: PodCliqueSet with optional topology constraints +3. **Validation**: Webhooks validate against TopologyDomain +4. **Translation**: Operator builds PodGang with three-level constraints +5. **Scheduling**: KAI scheduler applies topology constraints with fallback ### Sequence Diagram @@ -836,7 +839,7 @@ High-level end-to-end flow: - Auto-optimization active for all workloads (preferred constraints generated) - User-specified `packDomain` validated and enforced as required constraints - **Disabled**: When `--topology-domain-name` argument not provided - - Topology constraints in workload CRDs ignored duri ng scheduling + - Topology constraints in workload CRDs ignored during scheduling - Workloads schedule without topology awareness - **Toggling**: Cannot enable/disable during runtime - requires operator restart with updated configuration @@ -847,7 +850,8 @@ This section documents key design decisions and their resolutions. **Q: How will cluster admins map Grove topology constants to physical topology labels?** **A: The `TopologyDomain` CRD provides the mapping mechanism. Admins create a TopologyDomain resource with -an ordered list of levels, where each level maps a friendly name (e.g., "rack") to a node label key (e.g., " +an ordered list of levels, where each level maps a friendly name (e.g., "rack", "zone", "host") to a node label key ( +e.g., " topology.kubernetes.io/rack"). This provides a clean, declarative API for topology configuration. **Q: Should we allow changes to cluster topology levels and mappings after creation?** @@ -866,12 +870,11 @@ configuration: **A: Remain Pending** - For gang-scheduled workloads with topology constraints: - **Required Constraints** (user-specified `packDomain`): Must be satisfied; entire gang remains pending if unsatisfied - - **Preferred Constraints** (auto-generated): Best-effort optimization; scheduler can fall back to less strict - levels - - This behavior ensures workload integrity for tightly-coupled distributed inference workloads where partial - scheduling is ineffective - - Users relying on strict placement should use required constraints; users wanting flexibility should rely on - preferred constraints +- **Preferred Constraints** (auto-generated): Best-effort optimization; scheduler can fall back to less strict levels +- This behavior ensures workload integrity for tightly-coupled distributed inference workloads where partial scheduling + is ineffective +- Users relying on strict placement should use required constraints; users wanting flexibility should rely on preferred + constraints **Q: How will domain-level packing be realized in KAI scheduler?** From 62362c3331df10226cd59bbd1595d55bcea2bf6c Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 22 Oct 2025 18:32:14 +0300 Subject: [PATCH 06/15] CR Fixes Signed-off-by: Ron Kahn --- docs/designs/topology.md | 146 +++++++++++---------------------------- 1 file changed, 40 insertions(+), 106 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 9b362b3b..a68b0de5 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -18,7 +18,7 @@ applications require: 1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names (e.g., rack, zone, host) to node labels -2. **Operator Configuration**: Selects active topology via `--topology-domain-name` argument +2. **Operator Configuration**: Selects active topology via `OperatorConfiguration.TopologyDomainName` field 3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, PodClique) @@ -42,8 +42,7 @@ workload. - Root domain constraints for entire resource (RootDomain) - Ratio-based affinity groups between scaling groups (AffinityGroups with PackRatio) - Dynamic topology reconfiguration after creation -- Per-workload topology domain selection -- Automatic topology inference from workload characteristics +- Automatic suggest topology according to workload characteristics ## Proposal @@ -58,7 +57,7 @@ Grove implements topology-aware scheduling through three key components: - Multiple TopologyDomains supported for different environments 2. **Operator Configuration**: References TopologyDomain by name - - Operator argument `--topology-domain-name=default` selects which TopologyDomain to use + - `OperatorConfiguration.TopologyDomainName: default` selects which TopologyDomain to use - All workload validation performed against configured TopologyDomain - Enables switching between topologies without changing workloads @@ -67,54 +66,6 @@ Grove implements topology-aware scheduling through three key components: - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) - No direct TopologyDomain reference needed in workloads -### Component Interactions - -``` -TopologyDomain CRD ──┐ - (admin creates) │ - │ -Operator Config ─────┼──> Operator validates PackDomain - (--topology- │ against TopologyDomain.Spec.Levels - domain-name) │ - │ -PodCliqueSet ────────┘ - (packDomain: "rack") -``` - -### Automatic Optimization - -**Out-of-Box Optimization:** - -- Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") -- Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API -- Users get optimal packing without configuration - -**User Control:** - -- Users can specify **required** constraints via `packDomain` for strict placement requirements -- Required constraints validated and must be satisfied -- Preferred constraints enable best-effort optimization with graceful fallback - -### Controller Responsibilities - -The TopologyDomain controller manages: - -- **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration -- **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it - -## Out of Scope - -The following features are explicitly out of scope for this design: - -- **Spread Constraints**: ReplicaSpreadDomain for distributing replicas across domains for fault tolerance is not - supported -- **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a - topology domain is not supported -- **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + 1 - Decode ratios) is not supported -- **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, and - inference requirements - ## Design Details ### Architecture Overview @@ -137,7 +88,7 @@ The following features are explicitly out of scope for this design: │ │ │ (auto-generated) │ │ │ │ └────────────────────┘ │ │ │ │ -│ Operator Config: --topology-domain-name=default │ +│ Operator Config: OperatorConfiguration.TopologyDomainName=default │ │ │ │ │ │ (validates against) │ ├─────────┼───────────────────────────────────────────────────────────────┤ @@ -172,7 +123,7 @@ to Kubernetes node labels and establishes ordering from broadest to narrowest sc **Characteristics:** - **Cluster-scoped**: Multiple TopologyDomains can exist -- **Operator-selected**: Operator references one by name via `--topology-domain-name` argument +- **Operator-selected**: Operator references one by name via `OperatorConfiguration.TopologyDomainName` field - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) @@ -181,7 +132,7 @@ to Kubernetes node labels and establishes ordering from broadest to narrowest sc ```go // TopologyDomain defines the topology hierarchy for the cluster // This resource is immutable after creation -// Multiple TopologyDomain resources can exist; Grove operator references one via --topology-domain-name argument +// Multiple TopologyDomain resources can exist; Grove operator references one via OperatorConfiguration.TopologyDomainName field type TopologyDomain struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata,omitempty"` @@ -264,7 +215,7 @@ Steps: 1. Install Grove: `helm install grove` 2. Customize example above with your cluster's actual `topologyKey` values 3. Create resource: `kubectl apply -f topologydomain.yaml` -4. Configure operator with `--topology-domain-name` matching the resource name +4. Configure operator with `OperatorConfiguration.TopologyDomainName` matching the resource name 5. Create workloads with topology constraints Notes: @@ -306,8 +257,7 @@ The TopologyDomain controller manages the TopologyDomain resource lifecycle with **1. Kueue Topology Generation** -Automatically generates Kueue Topology CRD from the TopologyDomain referenced by operator's `--topology-domain-name` -argument. +Automatically generates Kueue Topology CRD from the TopologyDomain. **Why Kueue Topology is Required:** @@ -350,12 +300,6 @@ spec: - nodeLabel: "kubernetes.io/hostname" ``` -Key Points: - -- Admin only creates TopologyDomain; Kueue Topology auto-generated -- Owner reference ensures Kueue Topology deleted with TopologyDomain -- Same name for both resources -- No manual coordination required **Implementation Note:** @@ -373,67 +317,56 @@ Deletion Workflow: 2. Kubernetes blocks deletion (finalizer `grove.run.ai/topology-protection` present) 3. Controller reconciles: - Detects deletion request (deletion timestamp set) - - Scans cluster for any PodCliqueSet resources - - If PodCliqueSet exists: Keeps finalizer, deletion blocked - - If no PodCliqueSet exists: Removes finalizer, deletion proceeds + - Scans cluster for any PodGang resources whose `Spec.TopologyRef` references this TopologyDomain by name + - If any PodGang references this TopologyDomain: Keeps finalizer, deletion blocked + - If no PodGang references this TopologyDomain: Removes finalizer, deletion proceeds 4. Once finalizer removed, Kubernetes deletes TopologyDomain -Why Only Check PodCliqueSet: - -- Grove ownership hierarchy: PodCliqueSet owns PodCliqueScalingGroup and PodClique -- If no PodCliqueSet exists, other resources cannot exist Key Points: -- Admin must delete all PodCliqueSet before deleting TopologyDomain -- Controller continuously reconciles +- Admin must delete all PodCliqueSet whose PodGang references this TopologyDomain before deletion +- Controller checks PodGang.Spec.TopologyRef field to determine references +- Controller continuously reconciles deletion requests - Prevents orphaned workloads with invalid topology references #### Operator Configuration -Operator references TopologyDomain by name via command-line argument: +Operator references TopologyDomain by name via OperatorConfiguration manifest: ```yaml -apiVersion: apps/v1 -kind: Deployment +apiVersion: grove.run.ai/v1alpha1 +kind: OperatorConfiguration metadata: - name: grove-operator + name: grove-operator-config spec: - template: - spec: - containers: - - name: operator - args: - - --topology-domain-name=default # References TopologyDomain by name + # Specifies which TopologyDomain resource to use for validation + topologyDomainName: default ``` -Configuration: - -- `--topology-domain-name`: Specifies TopologyDomain resource name for validation -- Operator loads referenced TopologyDomain at startup -- All PodCliqueSet topology constraints validated against this TopologyDomain - **Runtime Behavior:** When TopologyDomain is Missing or Deleted: -- **Startup**: If `--topology-domain-name` is configured but TopologyDomain doesn't exist at startup, operator fails to +- **Startup**: If `OperatorConfiguration.TopologyDomainName` is configured but TopologyDomain doesn't exist at startup, + operator fails to start - Operator requires TopologyDomain to exist for auto-optimization (preferred constraints generation) - This explicit failure prevents silent degradation of topology features - - Admin must create TopologyDomain or remove `--topology-domain-name` argument before operator starts + - Admin must create TopologyDomain or remove `TopologyDomainName` field before operator starts - **During Runtime**: If TopologyDomain is deleted while operator is running: - - Finalizer prevents deletion while any PodCliqueSet resources exist + - Finalizer prevents deletion while any PodCliqueSet that reference it (using PodGang) exist - If all PodCliqueSet resources are removed and TopologyDomain is deleted: - Operator blocks creation of ALL new workloads (topology and non-topology) - - Admin must either create new TopologyDomain OR remove `--topology-domain-name` operator argument and restart + - Admin must either create new TopologyDomain OR remove `TopologyDomainName` from OperatorConfiguration and + restart - This explicit behavior prevents implicit edge cases and ensures topology configuration consistency **Multiple Topologies:** - Multiple TopologyDomain resources can exist (e.g., "aws-topology", "on-prem-topology") -- Operator argument selects which one to use +- OperatorConfiguration field selects which one to use - Enables different topology configurations per environment ### 2. Operator API Changes (Grove CRDs) @@ -508,7 +441,7 @@ The validation webhook ensures topology configuration consistency: **TopologyDomain Reference:** -- TopologyDomain specified in operator's `--topology-domain-name` must exist +- TopologyDomain specified in `OperatorConfiguration.TopologyDomainName` must exist - Referenced PackDomain name must exist in TopologyDomain.Spec.Levels - All validation performed against operator-configured TopologyDomain @@ -639,7 +572,7 @@ The operator translates Grove operator API to Grove Scheduler API with three-lev **TopologyRef Population:** - Set to Kueue Topology resource name (matches TopologyDomain name from operator config) -- Example: operator config `--topology-domain-name=default` → `TopologyRef.Name="default"` +- Example: `OperatorConfiguration.TopologyDomainName: default` → `TopologyRef.Name="default"` - KAI scheduler uses this to locate the Kueue Topology CRD **Constraint Translation (Required and Preferred):** @@ -724,7 +657,7 @@ When a PodCliqueSet is created or updated, the Grove Operator translates it into 2. **Operator Reconciles PodCliqueSet**: - Operator detects PodCliqueSet creation/update - - Loads TopologyDomain specified in operator config (`--topology-domain-name`) + - Loads TopologyDomain specified in `OperatorConfiguration.TopologyDomainName` - Prepares PodGang resource creation/update 3. **Build PodGang TopologyConstraint**: @@ -747,7 +680,7 @@ When a PodCliqueSet is created or updated, the Grove Operator translates it into 6. **Set TopologyRef**: - References Kueue Topology by name (matches TopologyDomain name from operator config) - - Example: `--topology-domain-name=default` → `TopologyRef.Name="default"` + - Example: `OperatorConfiguration.TopologyDomainName: default` → `TopologyRef.Name="default"` - KAI scheduler uses this to locate the Kueue Topology CRD 7. **Create/Update PodGang in Scheduler API**: @@ -809,7 +742,7 @@ High-level end-to-end flow: **Case 1: TopologyDomain Not Configured** -- If `--topology-domain-name` argument not provided to operator: topology features completely disabled +- If `OperatorConfiguration.TopologyDomainName` field not provided: topology features completely disabled - PodCliqueSet workloads without `packDomain` function normally - PodCliqueSet workloads with `packDomain` specified: validation webhook rejects creation (cannot validate without TopologyDomain) @@ -817,11 +750,12 @@ High-level end-to-end flow: **Case 2: TopologyDomain Configured but Missing at Startup** -- If `--topology-domain-name` argument provided but TopologyDomain resource doesn't exist: operator fails to start +- If `OperatorConfiguration.TopologyDomainName` field provided but TopologyDomain resource doesn't exist: operator fails + to start - Operator requires TopologyDomain to exist for auto-optimization - Admin must either: - Create the referenced TopologyDomain resource, OR - - Remove `--topology-domain-name` argument from operator configuration + - Remove `TopologyDomainName` field from OperatorConfiguration **Case 3: TopologyDomain Deleted During Runtime** @@ -831,14 +765,14 @@ High-level end-to-end flow: - Existing workloads continue to function (already scheduled) - Admin must either: - Create new TopologyDomain resource with same name, OR - - Remove `--topology-domain-name` argument and restart operator + - Remove `TopologyDomainName` field from OperatorConfiguration and restart operator **Case 4: Topology Features Enabled/Disabled** -- **Enabled**: When `--topology-domain-name` provided and TopologyDomain exists +- **Enabled**: When `OperatorConfiguration.TopologyDomainName` provided and TopologyDomain exists - Auto-optimization active for all workloads (preferred constraints generated) - User-specified `packDomain` validated and enforced as required constraints -- **Disabled**: When `--topology-domain-name` argument not provided +- **Disabled**: When `TopologyDomainName` field not provided in OperatorConfiguration - Topology constraints in workload CRDs ignored during scheduling - Workloads schedule without topology awareness - **Toggling**: Cannot enable/disable during runtime - requires operator restart with updated configuration @@ -854,14 +788,14 @@ an ordered list of levels, where each level maps a friendly name (e.g., "rack", e.g., " topology.kubernetes.io/rack"). This provides a clean, declarative API for topology configuration. -**Q: Should we allow changes to cluster topology levels and mappings after creation?** + **Q: Should we allow changes to cluster topology levels and mappings after creation?** **A: No (Immutable)** - TopologyDomain and all TopologyConstraint fields are immutable after creation. This prevents unpredictable behavior with in-flight workloads and maintains scheduling consistency. To change topology configuration: 1. Create a new TopologyDomain with updated configuration -2. Update operator's `--topology-domain-name` argument to reference new TopologyDomain +2. Update `OperatorConfiguration.TopologyDomainName` to reference new TopologyDomain 3. Drain or migrate existing workloads 4. Delete old TopologyDomain after all workloads are migrated From 83597345d37c35719c75af3c518b121ac10e287d Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 22 Oct 2025 18:37:20 +0300 Subject: [PATCH 07/15] restore deleted part Signed-off-by: Ron Kahn --- docs/designs/topology.md | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index a68b0de5..7659a39c 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -66,6 +66,40 @@ Grove implements topology-aware scheduling through three key components: - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) - No direct TopologyDomain reference needed in workloads +### Automatic Optimization + +**Out-of-Box Optimization:** + +- Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") +- Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API +- Users get optimal packing without configuration + +**User Control:** + +- Users can specify **required** constraints via `packDomain` for strict placement requirements +- Required constraints validated and must be satisfied +- Preferred constraints enable best-effort optimization with graceful fallback + +### Controller Responsibilities + +The TopologyDomain controller manages: + +- **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration +- **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it + +## Out of Scope + +The following features are explicitly out of scope for this design: + +- **Spread Constraints**: ReplicaSpreadDomain for distributing replicas across domains for fault tolerance is not + supported +- **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a + topology domain is not supported +- **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + 1 + Decode ratios) is not supported +- **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, and + inference requirements + ## Design Details ### Architecture Overview From 31683cf42dec97da28230051da01db789558d548 Mon Sep 17 00:00:00 2001 From: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com> Date: Wed, 22 Oct 2025 18:38:41 +0300 Subject: [PATCH 08/15] Update docs/designs/topology.md Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com> --- docs/designs/topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 7659a39c..6e74945d 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -4,7 +4,7 @@ This document defines the design for supporting topology-aware scheduling in the Grove operator. -**Motivation**: Topology-aware scheduling is critical for Grove's multinode inference workloads because these +**Motivation**: Topology-aware scheduling is critical for Grove's multi-node inference workloads because these applications require: - **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( From 912ac6e5f3f4ba96049a0d911a8ac5436ae7c232 Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Sun, 26 Oct 2025 23:31:20 +0200 Subject: [PATCH 09/15] docs: refine topology design and clarify singleton constraints for TopologyDomain Signed-off-by: Ron Kahn --- docs/designs/topology.md | 569 ++++++++------------------------------- 1 file changed, 119 insertions(+), 450 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 6e74945d..235e8261 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -7,32 +7,17 @@ This document defines the design for supporting topology-aware scheduling in the **Motivation**: Topology-aware scheduling is critical for Grove's multi-node inference workloads because these applications require: -- **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers ( - prefill and decode, etc.) +- **Network Locality**: Proximity improves high-bandwidth communication between leaders and their respective workers - **Coordinated Placement**: Related components (e.g., model shards) perform better when co-located within the same topology domain - **Latency Optimization**: Minimizing network hops between interdependent inference components improves end-to-end performance -**Design Approach**: This design introduces a flexible topology system with three main components: - -1. **TopologyDomain CRD**: Admin-configured cluster topology hierarchy mapping friendly names (e.g., rack, zone, host) - to node labels -2. **Operator Configuration**: Selects active topology via `OperatorConfiguration.TopologyDomainName` field -3. **TopologyConstraint**: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, - PodClique) - -**Key Feature**: Grove attempts automatic out-of-box topology optimization by generating preferred (best-effort) packing -constraints at all levels, even without user configuration. This opportunistic packing may improve performance when -cluster resources allow, but users should specify required constraints when strict placement is critical for their -workload. - ## Goals - Provide flexible, cluster-agnostic topology hierarchy definition via TopologyDomain CRD - Enable packing constraints for network locality across all Grove scalable resources -- Support multiple topology configurations for different environments -- Automatic Kueue Topology generation for KAI scheduler integration +- Enforce singleton topology for cluster-wide consistency - Immutable topology configuration ensuring scheduling consistency - Hierarchical constraint validation (child stricter than parent) @@ -46,59 +31,10 @@ workload. ## Proposal -### High-Level Approach - -Grove implements topology-aware scheduling through three key components: - -1. **TopologyDomain CRD**: Cluster-scoped resource defining topology hierarchy - - Admin creates TopologyDomain with ordered list of topology levels - - Each level maps friendly name (e.g., "rack", "zone", "host") to node label key (e.g., " - topology.kubernetes.io/rack") - - Multiple TopologyDomains supported for different environments - -2. **Operator Configuration**: References TopologyDomain by name - - `OperatorConfiguration.TopologyDomainName: default` selects which TopologyDomain to use - - All workload validation performed against configured TopologyDomain - - Enables switching between topologies without changing workloads - -3. **Workload API (TopologyConstraint)**: Users specify packing requirements - - PodCliqueSet, PodCliqueScalingGroup, and PodClique each have TopologyConstraint field - - Users reference level names from TopologyDomain (e.g., `packDomain: "rack"`) - - No direct TopologyDomain reference needed in workloads - -### Automatic Optimization - -**Out-of-Box Optimization:** - -- Operator automatically generates **preferred** constraints using strictest topology level (e.g., "host") -- Applied at all three levels (PodGang, NetworkPackGroup, PodGroup) during translation to scheduler API -- Users get optimal packing without configuration - -**User Control:** - -- Users can specify **required** constraints via `packDomain` for strict placement requirements -- Required constraints validated and must be satisfied -- Preferred constraints enable best-effort optimization with graceful fallback - -### Controller Responsibilities - -The TopologyDomain controller manages: - -- **Kueue Topology Generation**: Auto-creates Kueue Topology CRD for KAI scheduler integration -- **Deletion Protection**: Prevents deletion while PodCliqueSet resources reference it - -## Out of Scope - -The following features are explicitly out of scope for this design: - -- **Spread Constraints**: ReplicaSpreadDomain for distributing replicas across domains for fault tolerance is not - supported -- **Advanced Topology Constraints Per Replica**: RootDomain for constraining entire resource (all replicas) within a - topology domain is not supported -- **Ratio Grouping Between Groups**: AffinityGroups with PackRatio for complex workload patterns (e.g., 2 Prefill + 1 - Decode ratios) is not supported -- **Workload-Based Auto Constraints**: Automatic constraint generation based on workload characteristics, patterns, and - inference requirements +Grove implements topology-aware scheduling through a singleton TopologyDomain CRD, +operator configuration to enable/disable features, and user-specified TopologyConstraints in workloads. +The operator automatically generates preferred constraints (lower bound) for optimization +while allowing users to specify required constraints for strict placement (upper bound). ## Design Details @@ -110,25 +46,20 @@ The following features are explicitly out of scope for this design: ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ Admin Layer: │ -│ ┌──────────────────┐ ┌────────────────────┐ │ -│ │ TopologyDomain │─────────────▶│ TopologyDomain │ │ -│ │ CR │ │ Controller │ │ -│ │ (levels list) │ └─────────┬──────────┘ │ -│ └──────────────────┘ │ │ -│ │ │ │ -│ │ ▼ │ -│ │ ┌────────────────────┐ │ -│ │ │ Kueue Topology │ │ -│ │ │ (auto-generated) │ │ -│ │ └────────────────────┘ │ -│ │ │ -│ Operator Config: OperatorConfiguration.TopologyDomainName=default │ -│ │ │ -│ │ (validates against) │ -├─────────┼───────────────────────────────────────────────────────────────┤ -│ │ │ -│ User Layer: │ -│ ▼ │ +│ ┌──────────────────────┐ ┌──────────────────────┐ │ +│ │ TopologyDomain │ │ Kueue Topology │ │ +│ │ "grove-topology" │ │ "grove-topology" │ │ +│ │ (singleton) │ │ (manual creation) │ │ +│ └──────────┬───────────┘ └───────────┬──────────┘ │ +│ │ │ │ +│ │ │ │ +│ Operator Config: OperatorConfiguration.EnableTopology=true │ +│ │ │ │ +│ │ (validates against) │ (referenced by) │ +├─────────────┼───────────────────────────────────┼───────────────────────┤ +│ │ │ │ +│ User Layer: │ │ +│ ▼ │ │ │ ┌──────────────────┐ ┌────────────────────┐ │ │ │ PodCliqueSet │─────────────▶│ Grove Operator │ │ │ │ (packDomain) │ │ (reconciles) │ │ @@ -138,7 +69,8 @@ The following features are explicitly out of scope for this design: │ ▼ │ │ ┌────────────────────┐ │ │ │ PodGang │───────▶ KAI │ -│ │ • TopologyRef │ Scheduler │ +│ │ • Annotation: │ Scheduler │ +│ │ topology-name │ │ │ │ • 3-level topology │ │ │ │ (required+ │ │ │ │ preferred) │ │ @@ -154,19 +86,21 @@ The following features are explicitly out of scope for this design: TopologyDomain is a cluster-scoped CR that defines the topology hierarchy for scheduling. It maps friendly level names to Kubernetes node labels and establishes ordering from broadest to narrowest scope. +*note: this CR is independent of Kueue Topology CRD, which must be manually created by admin to align with Grove's +TopologyDomain for KAI scheduler usage.* (also named "grove-topology") **Characteristics:** -- **Cluster-scoped**: Multiple TopologyDomains can exist -- **Operator-selected**: Operator references one by name via `OperatorConfiguration.TopologyDomainName` field +- **Cluster-scoped singleton**: Only one TopologyDomain allowed with enforced name "grove-topology" - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) +- **Webhook-validated**: Webhook enforces singleton constraint and name validation **API Structure:** ```go // TopologyDomain defines the topology hierarchy for the cluster // This resource is immutable after creation -// Multiple TopologyDomain resources can exist; Grove operator references one via OperatorConfiguration.TopologyDomainName field +// Only one TopologyDomain can exist cluster-wide with enforced name "grove-topology" type TopologyDomain struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata,omitempty"` @@ -216,7 +150,7 @@ Description string `json:"description,omitempty"` apiVersion: grove.run.ai/v1alpha1 kind: TopologyDomain metadata: - name: default + name: grove-topology spec: levels: - name: region @@ -244,129 +178,48 @@ spec: **Creating TopologyDomain:** -Steps: - -1. Install Grove: `helm install grove` -2. Customize example above with your cluster's actual `topologyKey` values -3. Create resource: `kubectl apply -f topologydomain.yaml` -4. Configure operator with `OperatorConfiguration.TopologyDomainName` matching the resource name -5. Create workloads with topology constraints - -Notes: - -- TopologyDomain becomes immutable after creation -- Multiple TopologyDomains can exist; operator uses the one specified in its argument -- Ensure node labels exist on cluster nodes before creating workloads -- List order defines hierarchy: index 0 = broadest, last = narrowest -- Example hierarchy: `region` (0) > `zone` (1) > `datacenter` (2) > `block` (3) > `rack` (4) > `host` (5) > `numa` (6) +1. Customize example above with your cluster's actual `topologyKey` values +2. Create resource: `kubectl apply -f topologydomain.yaml` (name MUST be "grove-topology") +3. Configure operator with `OperatorConfiguration.EnableTopology: true` +4. Manually create Kueue Topology with same name and aligned levels for KAI scheduler **Validation:** -CRD-Level: - -- At least one level required (minimum 1, maximum 10) -- Level `name` required (max 63 chars) -- Level `topologyKey` required (max 316 chars) -- Level `description` optional (max 1024 chars) -- Entire levels list immutable after creation - -Webhook: - -- Each level `name` must be unique within the `levels` array of a single TopologyDomain -- Each `topologyKey` must be unique within the `levels` array of a single TopologyDomain -- Cannot modify any field after creation -- Deletion protection via controller finalizer - -**Node Label Responsibility:** - -- Cluster administrators are responsible for ensuring that node labels specified in `topologyKey` fields exist on - cluster nodes -- TopologyDomain creation succeeds even if labels don't exist yet (allows pre-configuration) -- Workloads may fail to schedule if referenced topology labels are missing from nodes -- Administrators should verify node labels match TopologyDomain configuration before creating workloads +- Resource name MUST be "grove-topology" (webhook enforces singleton) +- Only one TopologyDomain allowed cluster-wide +- Each level `name` and `topologyKey` must be unique +- Immutable after creation (webhook blocks updates) #### TopologyDomain Controller -The TopologyDomain controller manages the TopologyDomain resource lifecycle with two primary responsibilities: +The TopologyDomain controller manages the TopologyDomain resource lifecycle: -**1. Kueue Topology Generation** +**Deletion Protection** -Automatically generates Kueue Topology CRD from the TopologyDomain. - -**Why Kueue Topology is Required:** - -Grove uses its own TopologyDomain CRD for user-friendly admin/user API, but KAI scheduler specifically requires Kueue's -Topology CRD format for actual scheduling operations. The TopologyDomain controller bridges this gap by: - -- Reading Grove's TopologyDomain (user-friendly with level names like "rack", "zone") -- Automatically generating Kueue Topology (KAI scheduler's required format with node labels only) -- Maintaining consistency between both representations -- Eliminating manual coordination for admins - -This separation allows Grove to provide better UX while maintaining compatibility with KAI scheduler requirements. - -Generation Process: - -1. Controller watches TopologyDomain specified in operator argument -2. When TopologyDomain created, controller creates matching Kueue Topology -3. Kueue Topology name matches TopologyDomain name -4. Levels extracted from TopologyDomain.Spec.Levels using topologyKey field -5. Order preserved from TopologyDomain list - -Example: - -From TopologyDomain `default` with levels zone/rack/host, controller generates: - -```yaml -apiVersion: kueue.x-k8s.io/v1alpha1 -kind: Topology -metadata: - name: default - ownerReferences: - - apiVersion: grove.run.ai/v1alpha1 - kind: TopologyDomain - name: default - controller: true -spec: - levels: - - nodeLabel: "topology.kubernetes.io/zone" - - nodeLabel: "topology.kubernetes.io/rack" - - nodeLabel: "kubernetes.io/hostname" -``` - - -**Implementation Note:** - -To avoid importing the entire Kueue package with all its dependencies, the operator will use Kubernetes unstructured API -to create and manage Kueue Topology CRDs. This approach is acceptable since the Kueue Topology CRD structure is simple ( -just a list of node label keys). - -**2. Deletion Protection** - -Prevents TopologyDomain deletion while PodCliqueSet resources reference it using Kubernetes finalizer. +Prevents TopologyDomain deletion while any PodCliqueSet resources exist using Kubernetes finalizer. Deletion Workflow: -1. Admin runs `kubectl delete topologydomain default` -2. Kubernetes blocks deletion (finalizer `grove.run.ai/topology-protection` present) +1. Admin runs `kubectl delete topologydomain grove-topology` +2. Kubernetes blocks deletion (finalizer `grove.io/topologydomain` present) 3. Controller reconciles: - Detects deletion request (deletion timestamp set) - - Scans cluster for any PodGang resources whose `Spec.TopologyRef` references this TopologyDomain by name - - If any PodGang references this TopologyDomain: Keeps finalizer, deletion blocked - - If no PodGang references this TopologyDomain: Removes finalizer, deletion proceeds + - Scans cluster for any PodCliqueSet resources + - If any PodCliqueSet exists: Keeps finalizer, deletion blocked + - If no PodCliqueSet exists: Removes finalizer, deletion proceeds 4. Once finalizer removed, Kubernetes deletes TopologyDomain - Key Points: -- Admin must delete all PodCliqueSet whose PodGang references this TopologyDomain before deletion -- Controller checks PodGang.Spec.TopologyRef field to determine references +- Admin must delete all PodCliqueSet resources before deleting TopologyDomain +- Controller checks if any PodCliqueSet exists (no need to check specific references) +- Since topology is singleton, any PodCliqueSet potentially uses it - Controller continuously reconciles deletion requests -- Prevents orphaned workloads with invalid topology references +- Prevents orphaned workloads with invalid topology configuration #### Operator Configuration -Operator references TopologyDomain by name via OperatorConfiguration manifest: +Operator enables/disables topology features via OperatorConfiguration manifest: ```yaml apiVersion: grove.run.ai/v1alpha1 @@ -374,34 +227,19 @@ kind: OperatorConfiguration metadata: name: grove-operator-config spec: - # Specifies which TopologyDomain resource to use for validation - topologyDomainName: default + # Enables topology-aware scheduling features + enableTopology: true ``` -**Runtime Behavior:** +**Startup Behavior:** -When TopologyDomain is Missing or Deleted: +- If `EnableTopology: true` but "grove-topology" doesn't exist: operator fails to start +- Admin must create TopologyDomain "grove-topology" OR disable topology -- **Startup**: If `OperatorConfiguration.TopologyDomainName` is configured but TopologyDomain doesn't exist at startup, - operator fails to - start - - Operator requires TopologyDomain to exist for auto-optimization (preferred constraints generation) - - This explicit failure prevents silent degradation of topology features - - Admin must create TopologyDomain or remove `TopologyDomainName` field before operator starts +**Admin Responsibilities:** -- **During Runtime**: If TopologyDomain is deleted while operator is running: - - Finalizer prevents deletion while any PodCliqueSet that reference it (using PodGang) exist - - If all PodCliqueSet resources are removed and TopologyDomain is deleted: - - Operator blocks creation of ALL new workloads (topology and non-topology) - - Admin must either create new TopologyDomain OR remove `TopologyDomainName` from OperatorConfiguration and - restart - - This explicit behavior prevents implicit edge cases and ensures topology configuration consistency - -**Multiple Topologies:** - -- Multiple TopologyDomain resources can exist (e.g., "aws-topology", "on-prem-topology") -- OperatorConfiguration field selects which one to use -- Enables different topology configurations per environment +- Manually create Kueue Topology with name "grove-topology" for KAI scheduler +- Ensure topology levels align between Grove TopologyDomain and Kueue Topology ### 2. Operator API Changes (Grove CRDs) @@ -471,22 +309,12 @@ TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` #### Validation Webhook -The validation webhook ensures topology configuration consistency: - -**TopologyDomain Reference:** - -- TopologyDomain specified in `OperatorConfiguration.TopologyDomainName` must exist -- Referenced PackDomain name must exist in TopologyDomain.Spec.Levels -- All validation performed against operator-configured TopologyDomain - **Hierarchy Constraints:** -- Child resource PackDomain must be equal to or stricter than parent +- Child PackDomain must be equal to or stricter than parent (stricter = higher index in levels list) - PodCliqueSet → PodCliqueScalingGroup → PodClique hierarchy -- Stricter = higher index (narrower scope) in TopologyDomain.Spec.Levels -- Example: If parent uses "zone" (index 1), child can use "zone", "rack" (index 4), or "host" (index 5) +- Referenced PackDomain name must exist in TopologyDomain.Spec.Levels - Validation applies on both CREATE and UPDATE operations -- During updates, hierarchy constraints are re-validated to ensure child remains equal or stricter than parent ### 3. Scheduler API Changes (Contract with KAI) @@ -502,11 +330,6 @@ type PodGangSpec struct { // PodGroups is a list of member pod groups in the PodGang PodGroups []PodGroup `json:"podgroups"` -// TopologyRef references the Kueue Topology resource -// Points to Kueue Topology CRD auto-generated by TopologyDomain controller -// +optional -TopologyRef *string `json:"topologyRef,omitempty"` - // TopologyConstraint defines topology packing constraints for entire pod gang // Translated from PodCliqueSet.TopologyConstraint // Updated by operator on each reconciliation when PodCliqueSet topology constraints change @@ -524,6 +347,20 @@ PriorityClassName string `json:"priorityClassName,omitempty"` } ``` +**PodGang Metadata:** + +The operator adds topology information to PodGang metadata via annotation: + +```go +// Annotation added to PodGang +metadata: +annotations: +grove.run.ai/topology-name: "grove-topology" +``` + +This annotation allows the scheduler to locate the Kueue Topology resource without requiring a spec field, providing +flexibility for future API changes. + **NetworkPackGroupConfig:** ```go @@ -578,7 +415,9 @@ Preferred *PackConstraint `json:"preferred,omitempty"` } type PackConstraint struct { -// PackDomain references a level name from TopologyDomain.Spec.Levels +// PackDomain holds the topologyKey (not level name) for the topology constraint +// Operator translates user's level name to the corresponding topologyKey from TopologyDomain +// Example: "topology.kubernetes.io/rack" or "kubernetes.io/hostname" PackDomain string `json:"packDomain"` } ``` @@ -587,12 +426,15 @@ PackDomain string `json:"packDomain"` Fields Added: -- `PodGangSpec.TopologyRef *string` - References Kueue Topology CRD (optional pointer) - `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) - `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( optional pointer) - `PodGroup.TopologyConstraint *TopologyConstraint` - PodClique-level packing from PodClique (optional pointer) +Annotations Added: + +- `grove.run.ai/topology-name: "grove-topology"` - Annotation on PodGang metadata referencing topology name + Fields Removed: - `PodGangSpec.SpreadConstraints` - Not implemented; spread will be part of TopologyConstraint in future @@ -603,43 +445,44 @@ Fields Removed: The operator translates Grove operator API to Grove Scheduler API with three-level topology constraint hierarchy: -**TopologyRef Population:** +**Topology Annotation:** -- Set to Kueue Topology resource name (matches TopologyDomain name from operator config) -- Example: `OperatorConfiguration.TopologyDomainName: default` → `TopologyRef.Name="default"` -- KAI scheduler uses this to locate the Kueue Topology CRD +- Operator adds annotation `grove.run.ai/topology-name: "grove-topology"` to PodGang metadata +- KAI scheduler uses this annotation to locate the Kueue Topology CRD with name "grove-topology" +- Annotation approach provides API flexibility for future changes without breaking spec **Constraint Translation (Required and Preferred):** -The operator translates user's simple PackDomain into rich required/preferred structure in scheduler API: +The operator translates user's level names to topologyKeys and builds required/preferred structure: **Required Constraints:** -- If user specifies `packDomain: "rack"` → becomes `TopologyConstraint.Required.PackDomain = "rack"` +- User specifies level name: `packDomain: "rack"` +- Operator looks up topologyKey from TopologyDomain: `"topology.kubernetes.io/rack"` +- Writes to PodGang: `TopologyConstraint.Required.PackDomain = "topology.kubernetes.io/rack"` - If user doesn't specify packDomain → `Required` is nil -- Applied at the appropriate level (PodGang, NetworkPackGroup, or PodGroup) **Preferred Constraints (Auto-Generated):** - Operator ALWAYS generates preferred constraint at all three levels -- Uses strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") +- Uses topologyKey of strictest level (e.g., `"kubernetes.io/hostname"` for "host" level) - Enables out-of-box optimization even without user configuration - Scheduler can fallback to less strict levels if preferred cannot be satisfied **Three-Level Translation:** 1. **PodGang Level** (from PodCliqueSet): - - `PodGangSpec.TopologyConstraint.Required` ← user's `PodCliqueSet.TopologyConstraint.PackDomain` (if set) - - `PodGangSpec.TopologyConstraint.Preferred` ← auto-generated strictest level (e.g., "host") + - `PodGangSpec.TopologyConstraint.Required` ← topologyKey looked up from user's level name (if set) + - `PodGangSpec.TopologyConstraint.Preferred` ← topologyKey of strictest level (e.g., `"kubernetes.io/hostname"`) 2. **NetworkPackGroup Level** (from PodCliqueScalingGroup): - For each PCSG with TopologyConstraint, create NetworkPackGroupConfig - - `NetworkPackGroupConfig.TopologyConstraint.Required` ← user's `PCSG.TopologyConstraint.PackDomain` (if set) - - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← auto-generated strictest level + - `NetworkPackGroupConfig.TopologyConstraint.Required` ← topologyKey looked up from PCSG level name (if set) + - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← topologyKey of strictest level 3. **PodGroup Level** (from PodClique): - - `PodGroup.TopologyConstraint.Required` ← user's `PodClique.TopologyConstraint.PackDomain` (if set) - - `PodGroup.TopologyConstraint.Preferred` ← auto-generated strictest level + - `PodGroup.TopologyConstraint.Required` ← topologyKey looked up from PodClique level name (if set) + - `PodGroup.TopologyConstraint.Preferred` ← topologyKey of strictest level **Example Translation:** @@ -649,7 +492,7 @@ User creates PodCliqueSet: spec: template: topologyConstraint: - packDomain: "rack" # User specifies required constraint + packDomain: "rack" # User specifies level NAME ``` Operator translates to PodGang: @@ -658,9 +501,9 @@ Operator translates to PodGang: spec: topologyConstraint: required: - packDomain: "rack" # From user + packDomain: "topology.kubernetes.io/rack" # Operator looks up topologyKEY preferred: - packDomain: "host" # Auto-generated by operator + packDomain: "kubernetes.io/hostname" # Auto-generated topologyKEY of strictest level ``` **Hierarchy Validation:** @@ -671,11 +514,9 @@ spec: **Mutable Topology Constraints:** -- Users can update topology constraints at any time (PodCliqueSet, PodCliqueScalingGroup, PodClique levels) -- Constraint changes only affect new or unscheduled pods -- Already scheduled pods retain their current placement and are not rescheduled -- Operator re-translates constraints to PodGang on each reconciliation triggered by updates -- Useful for adjusting placement requirements when workloads fail to schedule due to resource constraints +- Users can update topology constraints at any time +- Changes only affect new or unscheduled pods (already scheduled pods retain placement) +- Operator re-translates constraints to PodGang on each reconciliation ## Component Architecture @@ -683,205 +524,33 @@ spec: When a PodCliqueSet is created or updated, the Grove Operator translates it into Grove Scheduler API (PodGang CRD): -**Step-by-Step Translation:** - -1. **PodCliqueSet Created/Updated**: - - User creates PodCliqueSet with optional `topologyConstraint.packDomain` - - Validation webhook validates against TopologyDomain +**Translation Steps:** -2. **Operator Reconciles PodCliqueSet**: - - Operator detects PodCliqueSet creation/update - - Loads TopologyDomain specified in `OperatorConfiguration.TopologyDomainName` - - Prepares PodGang resource creation/update +1. User creates PodCliqueSet with optional `topologyConstraint.packDomain` (level name, e.g., "rack") +2. Operator loads TopologyDomain "grove-topology" and builds PodGang: + - Looks up topologyKey for each user-specified level name (e.g., "rack" → "topology.kubernetes.io/rack") + - **PodGang level**: Required (topologyKey from PCS name) + Preferred (topologyKey of strictest level) + - **NetworkPackGroup level**: Required (topologyKey from PCSG name) + Preferred (topologyKey of strictest level) + - **PodGroup level**: Required (topologyKey from PodClique name) + Preferred (topologyKey of strictest level) + - Adds annotation `grove.run.ai/topology-name: "grove-topology"` to PodGang metadata +3. KAI scheduler reads annotation, uses topologyKeys to apply three-level topology constraints -3. **Build PodGang TopologyConstraint**: - - **Required**: From user's `PodCliqueSet.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated using strictest/lowest level from TopologyDomain.Spec.Levels (e.g., "host") - - Populates `PodGangSpec.TopologyConstraint` +### End-to-End Flow -4. **Build NetworkPackGroupConfigs**: - - For each PodCliqueScalingGroup with TopologyConstraint in PodCliqueSet - - Create NetworkPackGroupConfig entry with PodGroupNames from that PCSG - - **Required**: From `PCSG.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated strictest level - - Populates `PodGangSpec.NetworkPackGroupConfigs` - -5. **Build PodGroups with TopologyConstraint**: - - For each PodClique in PodCliqueSet, create corresponding PodGroup - - **Required**: From `PodClique.topologyConstraint.packDomain` (if specified) - - **Preferred**: Auto-generated strictest level - - Populates `PodGroup.TopologyConstraint` for each PodGroup - -6. **Set TopologyRef**: - - References Kueue Topology by name (matches TopologyDomain name from operator config) - - Example: `OperatorConfiguration.TopologyDomainName: default` → `TopologyRef.Name="default"` - - KAI scheduler uses this to locate the Kueue Topology CRD - -7. **Create/Update PodGang in Scheduler API**: - - Operator calls Grove Scheduler API to create/update PodGang - - PodGang now has complete topology information at three levels - - KAI scheduler consumes PodGang and applies topology-aware scheduling - -**Key Points:** - -- Operator reconciliation performs translation -- Preferred constraints auto-generated at reconciliation time for out-of-box optimization -- Three-level hierarchy maintained: PodGang > NetworkPackGroup > PodGroup -- TopologyRef connects PodGang to KAI scheduler's required Kueue Topology -- All levels get both required (user-specified) and preferred (auto-generated) constraints - -### Topology-Aware Scheduling Flow - -High-level end-to-end flow: - -1. **Admin Setup**: Create TopologyDomain, configure operator +1. **Admin Setup**: Create TopologyDomain "grove-topology", configure operator with `EnableTopology: true`, create + aligned Kueue Topology 2. **User Creates Workload**: PodCliqueSet with optional topology constraints -3. **Validation**: Webhooks validate against TopologyDomain -4. **Translation**: Operator builds PodGang with three-level constraints -5. **Scheduling**: KAI scheduler applies topology constraints with fallback - -### Sequence Diagram - -``` -┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ PodCliqueSet │ │ Grove Operator │ │ Grove Scheduler │ │ Scheduler │ -│ │ │ │ │ API │ │ │ -└──────┬───────┘ └─────────┬────────┘ └────────┬────────┘ └────────┬────────┘ - │ │ │ │ - │ CREATE/UPDATE │ │ │ - ├─────────────────────▶│ │ │ - │ │ │ │ - │ │ 1. Validation webhook │ │ - │ │ validates against │ │ - │ │ TopologyDomain │ │ - │ │ │ │ - │ │ 2. Translate to │ │ - │ │ PodGang(s) spec │ │ - │ │ │ │ - │ │ CREATE/UPDATE PodGangs│ │ - │ ├─────────────────────▶ │ │ - │ │ │ │ - │ │ │ SCHEDULE Pods │ - │ │ ├─────────────────────▶│ - │ │ │ │ - │ │ │ │ Apply topology - │ │ │ │ using Kueue - │ │ │ │ Topology CRD - │ │ │ │ -``` - -## Implementation Notes - -### Edge Cases - -**Case 1: TopologyDomain Not Configured** - -- If `OperatorConfiguration.TopologyDomainName` field not provided: topology features completely disabled -- PodCliqueSet workloads without `packDomain` function normally -- PodCliqueSet workloads with `packDomain` specified: validation webhook rejects creation (cannot validate without - TopologyDomain) -- No auto-optimization (preferred constraints) applied - -**Case 2: TopologyDomain Configured but Missing at Startup** - -- If `OperatorConfiguration.TopologyDomainName` field provided but TopologyDomain resource doesn't exist: operator fails - to start -- Operator requires TopologyDomain to exist for auto-optimization -- Admin must either: - - Create the referenced TopologyDomain resource, OR - - Remove `TopologyDomainName` field from OperatorConfiguration - -**Case 3: TopologyDomain Deleted During Runtime** - -- Finalizer prevents deletion while any PodCliqueSet resources exist -- If TopologyDomain deleted after all PodCliqueSet resources removed: - - Operator blocks creation of ALL new workloads (topology and non-topology) - - Existing workloads continue to function (already scheduled) -- Admin must either: - - Create new TopologyDomain resource with same name, OR - - Remove `TopologyDomainName` field from OperatorConfiguration and restart operator - -**Case 4: Topology Features Enabled/Disabled** - -- **Enabled**: When `OperatorConfiguration.TopologyDomainName` provided and TopologyDomain exists - - Auto-optimization active for all workloads (preferred constraints generated) - - User-specified `packDomain` validated and enforced as required constraints -- **Disabled**: When `TopologyDomainName` field not provided in OperatorConfiguration - - Topology constraints in workload CRDs ignored during scheduling - - Workloads schedule without topology awareness -- **Toggling**: Cannot enable/disable during runtime - requires operator restart with updated configuration - -### Resolved Design Questions - -This section documents key design decisions and their resolutions. - -**Q: How will cluster admins map Grove topology constants to physical topology labels?** - -**A: The `TopologyDomain` CRD provides the mapping mechanism. Admins create a TopologyDomain resource with -an ordered list of levels, where each level maps a friendly name (e.g., "rack", "zone", "host") to a node label key ( -e.g., " -topology.kubernetes.io/rack"). This provides a clean, declarative API for topology configuration. - - **Q: Should we allow changes to cluster topology levels and mappings after creation?** - -**A: No (Immutable)** - TopologyDomain and all TopologyConstraint fields are immutable after creation. This -prevents unpredictable behavior with in-flight workloads and maintains scheduling consistency. To change topology -configuration: - -1. Create a new TopologyDomain with updated configuration -2. Update `OperatorConfiguration.TopologyDomainName` to reference new TopologyDomain -3. Drain or migrate existing workloads -4. Delete old TopologyDomain after all workloads are migrated - -**Q: If topology constraints cannot be satisfied, should workloads remain pending or schedule anyway?** - -**A: Remain Pending** - For gang-scheduled workloads with topology constraints: - -- **Required Constraints** (user-specified `packDomain`): Must be satisfied; entire gang remains pending if unsatisfied -- **Preferred Constraints** (auto-generated): Best-effort optimization; scheduler can fall back to less strict levels -- This behavior ensures workload integrity for tightly-coupled distributed inference workloads where partial scheduling - is ineffective -- Users relying on strict placement should use required constraints; users wanting flexibility should rely on preferred - constraints - -**Q: How will domain-level packing be realized in KAI scheduler?** - -**A: Contract Defined** - The `PodGang` CRD serves as the API contract between Grove operator and KAI scheduler. -Expected scheduler behavior: - -1. **Topology Resolution**: KAI Pod Grouper reads `PodGang.spec.topologyRef` to locate Kueue Topology CRD -2. **Constraint Processing**: For each topology constraint (PodGang, NetworkPackGroup, PodGroup level): - - Process `required` constraints first (must satisfy) - - Apply `preferred` constraints as optimization hints (best-effort) -3. **Domain Filtering**: Filter cluster nodes to find topology domains (e.g., single rack, single host) that satisfy: - - Resource requests for all pods in the constraint scope - - Required topology level specified in constraint -4. **Placement**: Schedule all pods in the constraint scope within the chosen topology domain -5. **Fallback**: For preferred constraints, fall back to less strict topology levels if preferred level cannot be - satisfied -6. **Gang Semantics**: If required constraints cannot be satisfied, entire gang remains unscheduled (all-or-nothing) - -This contract ensures Grove workloads receive topology-aware placement while maintaining scheduler independence. +3. **Validation**: Webhook validates against TopologyDomain +4. **Translation**: Operator builds PodGang with three-level constraints (required + preferred) +5. **Scheduling**: KAI scheduler reads annotation, applies topology constraints with fallback ## Security and RBAC -The topology system requires careful RBAC configuration to ensure proper separation of concerns between cluster -administrators and the operator. - -### ClusterRole: Grove Operator - -The Grove operator requires read access to TopologyDomain and full management of Kueue Topology: +Grove operator requires read access to TopologyDomain and permission to manage finalizers: ```yaml -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: grove-operator-topology rules: - apiGroups: [ "grove.run.ai" ] - resources: [ "topologydomains" ] - verbs: [ "get", "list", "watch" ] - - apiGroups: [ "kueue.x-k8s.io" ] - resources: [ "topologies" ] - verbs: [ "create", "delete", "get", "list", "watch", "update", "patch" ] + resources: [ "topologydomains", "topologydomains/finalizers" ] + verbs: [ "get", "list", "watch", "update" ] ``` From 3a8d2dfadd907ace7d52ba16c8bc96d15d637178 Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Mon, 27 Oct 2025 10:59:58 +0200 Subject: [PATCH 10/15] docs: update topology documentation to clarify naming and constraints for TopologyDomain Signed-off-by: Ron Kahn --- docs/designs/topology.md | 126 +++++++++++++++++++++++---------------- 1 file changed, 74 insertions(+), 52 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 235e8261..a0e48c3a 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -90,10 +90,11 @@ to Kubernetes node labels and establishes ordering from broadest to narrowest sc TopologyDomain for KAI scheduler usage.* (also named "grove-topology") **Characteristics:** -- **Cluster-scoped singleton**: Only one TopologyDomain allowed with enforced name "grove-topology" +- **Cluster-scoped singleton**: Only one TopologyDomain allowed cluster-wide, user chooses name +- **Default name**: "grove-topology" used when topologyDomainName not specified in operator config - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) -- **Webhook-validated**: Webhook enforces singleton constraint and name validation +- **Webhook-validated**: Webhook enforces singleton constraint (any name allowed) **API Structure:** @@ -147,10 +148,10 @@ Description string `json:"description,omitempty"` **Example TopologyDomain:** ```yaml -apiVersion: grove.run.ai/v1alpha1 +apiVersion: grove.io/v1alpha1 kind: TopologyDomain metadata: - name: grove-topology + name: my-cluster-topology # User chooses name spec: levels: - name: region @@ -179,16 +180,19 @@ spec: **Creating TopologyDomain:** 1. Customize example above with your cluster's actual `topologyKey` values -2. Create resource: `kubectl apply -f topologydomain.yaml` (name MUST be "grove-topology") -3. Configure operator with `OperatorConfiguration.EnableTopology: true` -4. Manually create Kueue Topology with same name and aligned levels for KAI scheduler +2. Choose a name for your topology: + - Use custom name (e.g., "my-cluster-topology") OR + - Use default name "grove-topology" (no config needed) +3. Create resource: `kubectl apply -f topologydomain.yaml` +4. If using custom name: configure operator with topology name in OperatorConfiguration +5. Manually create Kueue Topology with same name and aligned levels for KAI scheduler **Validation:** -- Resource name MUST be "grove-topology" (webhook enforces singleton) -- Only one TopologyDomain allowed cluster-wide +- Only one TopologyDomain allowed cluster-wide (webhook enforces singleton, any name allowed) - Each level `name` and `topologyKey` must be unique - Immutable after creation (webhook blocks updates) +- Deletion protection via controller finalizer (blocks deletion while PodCliqueSet resources exist) #### TopologyDomain Controller @@ -200,7 +204,7 @@ Prevents TopologyDomain deletion while any PodCliqueSet resources exist using Ku Deletion Workflow: -1. Admin runs `kubectl delete topologydomain grove-topology` +1. Admin runs `kubectl delete topologydomain ` 2. Kubernetes blocks deletion (finalizer `grove.io/topologydomain` present) 3. Controller reconciles: - Detects deletion request (deletion timestamp set) @@ -222,23 +226,28 @@ Key Points: Operator enables/disables topology features via OperatorConfiguration manifest: ```yaml -apiVersion: grove.run.ai/v1alpha1 +apiVersion: grove.io/v1alpha1 kind: OperatorConfiguration metadata: name: grove-operator-config spec: - # Enables topology-aware scheduling features - enableTopology: true + topology: + enabled: true + topologyDomainName: "my-cluster-topology" # Optional, defaults to "grove-topology" ``` **Startup Behavior:** -- If `EnableTopology: true` but "grove-topology" doesn't exist: operator fails to start -- Admin must create TopologyDomain "grove-topology" OR disable topology +- If `topology.enabled: true`: + - `topologyDomainName` not specified → defaults to "grove-topology" + - Operator looks for TopologyDomain with configured name (defaults to "grove-topology") + - If TopologyDomain with that name doesn't exist → operator fails to start +- If `topology.enabled: false`: topology features disabled +- Admin must create TopologyDomain with matching name OR disable topology **Admin Responsibilities:** -- Manually create Kueue Topology with name "grove-topology" for KAI scheduler +- Manually create Kueue Topology with same name as Grove TopologyDomain for KAI scheduler - Ensure topology levels align between Grove TopologyDomain and Kueue Topology ### 2. Operator API Changes (Grove CRDs) @@ -247,10 +256,12 @@ spec: ```go type TopologyConstraint struct { -// PackDomain references a level name from TopologyDomain.Spec.Levels -// Defines required topology packing constraint for replicas -// Replicas packed together within specified topology level for network locality -PackDomain *string `json:"packDomain,omitempty"` +// PackLevel specifies the topology level name for grouping replicas +// Controls placement constraint for EACH individual replica instance +// Example: "rack" means each replica independently placed within one rack +// Note: Does NOT constrain all replicas to the same rack together +// Different replicas can be in different topology domains +PackLevel *string `json:"packLevel,omitempty"` } ``` @@ -311,9 +322,9 @@ TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` **Hierarchy Constraints:** -- Child PackDomain must be equal to or stricter than parent (stricter = higher index in levels list) +- Child PackLevel must be equal to or stricter than parent (stricter = higher index in levels list) - PodCliqueSet → PodCliqueScalingGroup → PodClique hierarchy -- Referenced PackDomain name must exist in TopologyDomain.Spec.Levels +- Referenced PackLevel name must exist in TopologyDomain.Spec.Levels - Validation applies on both CREATE and UPDATE operations ### 3. Scheduler API Changes (Contract with KAI) @@ -336,11 +347,11 @@ PodGroups []PodGroup `json:"podgroups"` // +optional TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` -// NetworkPackGroupConfigs defines groups of PodGroups for network optimization +// TopologyConstraintGroupConfigs defines groups of PodGroups for topology-aware placement // Enhanced with topology constraints for PCSG-level packing // Updated by operator on each reconciliation when PCSG topology constraints change // +optional -NetworkPackGroupConfigs []NetworkPackGroupConfig `json:"networkPackGroupConfigs,omitempty"` +TopologyConstraintGroupConfigs []TopologyConstraintGroupConfig `json:"topologyConstraintGroupConfigs,omitempty"` // PriorityClassName is the name of the PriorityClass for the PodGang PriorityClassName string `json:"priorityClassName,omitempty"` @@ -355,18 +366,18 @@ The operator adds topology information to PodGang metadata via annotation: // Annotation added to PodGang metadata: annotations: -grove.run.ai/topology-name: "grove-topology" +grove.run.ai/topology-name: "" ``` This annotation allows the scheduler to locate the Kueue Topology resource without requiring a spec field, providing flexibility for future API changes. -**NetworkPackGroupConfig:** +**TopologyConstraintGroupConfig:** ```go -// NetworkPackGroupConfig indicates PodGroups should be optimally placed w.r.t cluster's network topology -type NetworkPackGroupConfig struct { -// PodGroupNames is the list of PodGroup names in the network pack group +// TopologyConstraintGroupConfig defines topology constraints for a group of PodGroups +type TopologyConstraintGroupConfig struct { +// PodGroupNames is the list of PodGroup names in the topology constraint group PodGroupNames []string `json:"podGroupNames"` // TopologyConstraint defines topology packing constraints for this group @@ -415,10 +426,10 @@ Preferred *PackConstraint `json:"preferred,omitempty"` } type PackConstraint struct { -// PackDomain holds the topologyKey (not level name) for the topology constraint +// PackLevel holds the topologyKey (not level name) for the topology constraint // Operator translates user's level name to the corresponding topologyKey from TopologyDomain // Example: "topology.kubernetes.io/rack" or "kubernetes.io/hostname" -PackDomain string `json:"packDomain"` +PackLevel string `json:"packLevel"` } ``` @@ -427,13 +438,13 @@ PackDomain string `json:"packDomain"` Fields Added: - `PodGangSpec.TopologyConstraint *TopologyConstraint` - PodGang-level packing from PodCliqueSet (optional pointer) -- `NetworkPackGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from PodCliqueScalingGroup ( - optional pointer) +- `TopologyConstraintGroupConfig.TopologyConstraint *TopologyConstraint` - PCSG-level packing from + PodCliqueScalingGroup (optional pointer) - `PodGroup.TopologyConstraint *TopologyConstraint` - PodClique-level packing from PodClique (optional pointer) Annotations Added: -- `grove.run.ai/topology-name: "grove-topology"` - Annotation on PodGang metadata referencing topology name +- `grove.run.ai/topology-name: ""` - Annotation on PodGang metadata referencing topology name Fields Removed: @@ -447,8 +458,9 @@ The operator translates Grove operator API to Grove Scheduler API with three-lev **Topology Annotation:** -- Operator adds annotation `grove.run.ai/topology-name: "grove-topology"` to PodGang metadata -- KAI scheduler uses this annotation to locate the Kueue Topology CRD with name "grove-topology" +- Operator adds annotation `grove.run.ai/topology-name: ""` to PodGang metadata +- Annotation value matches the TopologyDomain name from operator configuration +- KAI scheduler uses this annotation to locate the corresponding Kueue Topology CRD - Annotation approach provides API flexibility for future changes without breaking spec **Constraint Translation (Required and Preferred):** @@ -457,10 +469,10 @@ The operator translates user's level names to topologyKeys and builds required/p **Required Constraints:** -- User specifies level name: `packDomain: "rack"` +- User specifies level name: `packLevel: "rack"` - Operator looks up topologyKey from TopologyDomain: `"topology.kubernetes.io/rack"` -- Writes to PodGang: `TopologyConstraint.Required.PackDomain = "topology.kubernetes.io/rack"` -- If user doesn't specify packDomain → `Required` is nil +- Writes to PodGang: `TopologyConstraint.Required.PackLevel = "topology.kubernetes.io/rack"` +- If user doesn't specify packLevel → `Required` is nil **Preferred Constraints (Auto-Generated):** @@ -472,27 +484,30 @@ The operator translates user's level names to topologyKeys and builds required/p **Three-Level Translation:** 1. **PodGang Level** (from PodCliqueSet): - - `PodGangSpec.TopologyConstraint.Required` ← topologyKey looked up from user's level name (if set) - - `PodGangSpec.TopologyConstraint.Preferred` ← topologyKey of strictest level (e.g., `"kubernetes.io/hostname"`) + - `PodGangSpec.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from user's level name (if set) + - `PodGangSpec.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level (e.g., + `"kubernetes.io/hostname"`) -2. **NetworkPackGroup Level** (from PodCliqueScalingGroup): - - For each PCSG with TopologyConstraint, create NetworkPackGroupConfig - - `NetworkPackGroupConfig.TopologyConstraint.Required` ← topologyKey looked up from PCSG level name (if set) - - `NetworkPackGroupConfig.TopologyConstraint.Preferred` ← topologyKey of strictest level +2. **TopologyConstraintGroup Level** (from PodCliqueScalingGroup): + - For each PCSG with TopologyConstraint, create TopologyConstraintGroupConfig + - `TopologyConstraintGroupConfig.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from PCSG level + name (if set) + - `TopologyConstraintGroupConfig.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level 3. **PodGroup Level** (from PodClique): - - `PodGroup.TopologyConstraint.Required` ← topologyKey looked up from PodClique level name (if set) - - `PodGroup.TopologyConstraint.Preferred` ← topologyKey of strictest level + - `PodGroup.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from PodClique level name (if set) + - `PodGroup.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level **Example Translation:** -User creates PodCliqueSet: +User creates PodCliqueSet with 3 replicas: ```yaml spec: + replicas: 3 template: topologyConstraint: - packDomain: "rack" # User specifies level NAME + packLevel: "rack" # User specifies level NAME (per-replica constraint) ``` Operator translates to PodGang: @@ -501,11 +516,18 @@ Operator translates to PodGang: spec: topologyConstraint: required: - packDomain: "topology.kubernetes.io/rack" # Operator looks up topologyKEY + packLevel: "topology.kubernetes.io/rack" # Operator looks up topologyKEY preferred: - packDomain: "kubernetes.io/hostname" # Auto-generated topologyKEY of strictest level + packLevel: "kubernetes.io/hostname" # Auto-generated topologyKEY of strictest level ``` +**Per-Replica Behavior:** + +- Replica 0: all pods constrained to one rack (e.g., rack-a) +- Replica 1: all pods constrained to one rack (e.g., rack-b) +- Replica 2: all pods constrained to one rack (e.g., rack-a) +- Different replicas can be in different racks (NOT all forced to same rack) + **Hierarchy Validation:** - Child required constraints must be equal or stricter than parent required constraints @@ -550,7 +572,7 @@ Grove operator requires read access to TopologyDomain and permission to manage f ```yaml rules: - - apiGroups: [ "grove.run.ai" ] + - apiGroups: [ "grove.io" ] resources: [ "topologydomains", "topologydomains/finalizers" ] verbs: [ "get", "list", "watch", "update" ] ``` From 0c7eee0fa39e7e8d3208d7fdf43d4100e1e3cdee Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 29 Oct 2025 15:34:53 +0200 Subject: [PATCH 11/15] docs: enhance topology documentation with detailed level definitions and constraints Signed-off-by: Ron Kahn --- docs/designs/topology.md | 125 +++++++++++++++++++++++---------------- 1 file changed, 75 insertions(+), 50 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index a0e48c3a..efd514c9 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -94,11 +94,40 @@ TopologyDomain for KAI scheduler usage.* (also named "grove-topology") - **Default name**: "grove-topology" used when topologyDomainName not specified in operator config - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) +- **Predefined ordering**: Region > Zone > DataCenter > Block > SubBlock > Rack > Host > Numa (broadest to narrowest) - **Webhook-validated**: Webhook enforces singleton constraint (any name allowed) +**TopologyLevelName Definitions:** + +- **Region**: Network local to a CSP region +- **Zone**: Network local to a CSP availability-zone within a region +- **DataCenter**: Network local to a data-center within a CSP availability-zone +- **Block**: Network local to a switching block unit within a data-center +- **SubBlock**: Sub-switching block unit within a larger block +- **Rack**: First-level network grouping of compute hosts (includes NVLink domains as logical racks) +- **Host**: Individual compute host +- **Numa**: NUMA node (processor and memory locality domain) within a compute host + **API Structure:** ```go +// TopologyLevelName represents a predefined topology level in the hierarchy +type TopologyLevelName string + +const ( +TopologyLevelRegion TopologyLevelName = "region" +TopologyLevelZone TopologyLevelName = "zone" +TopologyLevelDataCenter TopologyLevelName = "datacenter" +TopologyLevelBlock TopologyLevelName = "block" +TopologyLevelSubBlock TopologyLevelName = "subblock" +TopologyLevelRack TopologyLevelName = "rack" +TopologyLevelHost TopologyLevelName = "host" +TopologyLevelNuma TopologyLevelName = "numa" +) + +// Topology ordering (broadest to narrowest): +// Region > Zone > DataCenter > Block > SubBlock > Rack > Host > Numa + // TopologyDomain defines the topology hierarchy for the cluster // This resource is immutable after creation // Only one TopologyDomain can exist cluster-wide with enforced name "grove-topology" @@ -112,22 +141,19 @@ Spec TopologyDomainSpec `json:"spec,omitempty"` type TopologyDomainSpec struct { // Levels is an ordered list of topology levels from broadest to narrowest scope // The order in this list defines the hierarchy (index 0 = highest level) -// This field is immutable +// This field is immutable after creation // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="levels list is immutable" // +kubebuilder:validation:MinItems=1 -// +kubebuilder:validation:MaxItems=10 +// +kubebuilder:validation:MaxItems=8 Levels []TopologyLevel `json:"levels"` } type TopologyLevel struct { -// Name is the level identifier used in TopologyConstraint references -// Must be a valid DNS label (lowercase alphanumeric with hyphens) -// Examples: "zone", "rack", "host" +// Name is the predefined level identifier used in TopologyConstraint references +// Must be one of: region, zone, datacenter, block, subblock, rack, host, numa // +kubebuilder:validation:Required -// +kubebuilder:validation:MinLength=1 -// +kubebuilder:validation:MaxLength=63 -// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$` -Name string `json:"name"` +// +kubebuilder:validation:Enum=region;zone;datacenter;block;subblock;rack;host;numa +Name TopologyLevelName `json:"name"` // TopologyKey is the node label key that identifies this topology domain // Must be a valid Kubernetes label key (qualified name) @@ -137,11 +163,6 @@ Name string `json:"name"` // +kubebuilder:validation:MaxLength=316 // +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$` TopologyKey string `json:"topologyKey"` - -// Description provides human-readable information about this level -// +kubebuilder:validation:MaxLength=1024 -// +optional -Description string `json:"description,omitempty"` } ``` @@ -156,25 +177,20 @@ spec: levels: - name: region topologyKey: "topology.kubernetes.io/region" - description: "Cloud provider region" - name: zone topologyKey: "topology.kubernetes.io/zone" - description: "Availability zone within region" - name: datacenter topologyKey: "topology.kubernetes.io/datacenter" - description: "Data center within zone" - name: block topologyKey: "topology.kubernetes.io/block" - description: "Switching block within datacenter" + - name: subblock + topologyKey: "topology.kubernetes.io/subblock" - name: rack topologyKey: "topology.kubernetes.io/rack" - description: "Network rack grouping" - name: host topologyKey: "kubernetes.io/hostname" - description: "Individual compute host" - name: numa topologyKey: "topology.kubernetes.io/numa" - description: "NUMA node within host" ``` **Creating TopologyDomain:** @@ -190,7 +206,11 @@ spec: **Validation:** - Only one TopologyDomain allowed cluster-wide (webhook enforces singleton, any name allowed) +- Level names must be from predefined set: region, zone, datacenter, block, subblock, rack, host, numa (enum validation) - Each level `name` and `topologyKey` must be unique +- Mutation webhook automatically reorders levels to match predefined ordering (Region > Zone > DataCenter > Block > + SubBlock > Rack > Host > Numa) +- Admins can skip intermediate levels (e.g., define only region, rack, host) - Immutable after creation (webhook blocks updates) - Deletion protection via controller finalizer (blocks deletion while PodCliqueSet resources exist) @@ -258,10 +278,12 @@ spec: type TopologyConstraint struct { // PackLevel specifies the topology level name for grouping replicas // Controls placement constraint for EACH individual replica instance +// Must be one of: region, zone, datacenter, block, subblock, rack, host, numa // Example: "rack" means each replica independently placed within one rack // Note: Does NOT constrain all replicas to the same rack together // Different replicas can be in different topology domains -PackLevel *string `json:"packLevel,omitempty"` +// +kubebuilder:validation:Enum=region;zone;datacenter;block;subblock;rack;host;numa +PackLevel *TopologyLevelName `json:"packLevel,omitempty"` } ``` @@ -413,23 +435,25 @@ TopologyConstraint *TopologyConstraint `json:"topologyConstraint,omitempty"` ```go type TopologyConstraint struct { +// PackConstraint defines topology packing constraint with required and preferred levels +// Operator translates user's level name to corresponding topologyKeys +// +optional +PackConstraint *TopologyPackConstraint `json:"packConstraint,omitempty"` +} + +type TopologyPackConstraint struct { // Required defines topology constraint that must be satisfied -// Populated from user's packDomain specification in operator API +// Holds topologyKey (not level name) translated from user's packLevel specification +// Example: "topology.kubernetes.io/rack" // +optional -Required *PackConstraint `json:"required,omitempty"` +Required *string `json:"required,omitempty"` // Preferred defines best-effort topology constraint -// Auto-generated by operator using strictest level for optimization +// Auto-generated by operator using strictest level topologyKey for optimization // Scheduler can fallback to less strict levels if preferred cannot be satisfied +// Example: "kubernetes.io/hostname" // +optional -Preferred *PackConstraint `json:"preferred,omitempty"` -} - -type PackConstraint struct { -// PackLevel holds the topologyKey (not level name) for the topology constraint -// Operator translates user's level name to the corresponding topologyKey from TopologyDomain -// Example: "topology.kubernetes.io/rack" or "kubernetes.io/hostname" -PackLevel string `json:"packLevel"` +Preferred *string `json:"preferred,omitempty"` } ``` @@ -471,32 +495,33 @@ The operator translates user's level names to topologyKeys and builds required/p - User specifies level name: `packLevel: "rack"` - Operator looks up topologyKey from TopologyDomain: `"topology.kubernetes.io/rack"` -- Writes to PodGang: `TopologyConstraint.Required.PackLevel = "topology.kubernetes.io/rack"` -- If user doesn't specify packLevel → `Required` is nil +- Writes to PodGang: `TopologyConstraint.PackConstraint.Required = "topology.kubernetes.io/rack"` +- If user doesn't specify packLevel → `PackConstraint.Required` is nil **Preferred Constraints (Auto-Generated):** - Operator ALWAYS generates preferred constraint at all three levels - Uses topologyKey of strictest level (e.g., `"kubernetes.io/hostname"` for "host" level) +- Writes to PodGang: `TopologyConstraint.PackConstraint.Preferred = "kubernetes.io/hostname"` - Enables out-of-box optimization even without user configuration - Scheduler can fallback to less strict levels if preferred cannot be satisfied **Three-Level Translation:** 1. **PodGang Level** (from PodCliqueSet): - - `PodGangSpec.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from user's level name (if set) - - `PodGangSpec.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level (e.g., + - `PodGangSpec.TopologyConstraint.PackConstraint.Required` ← topologyKey looked up from user's level name (if set) + - `PodGangSpec.TopologyConstraint.PackConstraint.Preferred` ← topologyKey of strictest level (e.g., `"kubernetes.io/hostname"`) 2. **TopologyConstraintGroup Level** (from PodCliqueScalingGroup): - For each PCSG with TopologyConstraint, create TopologyConstraintGroupConfig - - `TopologyConstraintGroupConfig.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from PCSG level + - `TopologyConstraintGroupConfig.TopologyConstraint.PackConstraint.Required` ← topologyKey looked up from PCSG level name (if set) - - `TopologyConstraintGroupConfig.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level + - `TopologyConstraintGroupConfig.TopologyConstraint.PackConstraint.Preferred` ← topologyKey of strictest level 3. **PodGroup Level** (from PodClique): - - `PodGroup.TopologyConstraint.Required.PackLevel` ← topologyKey looked up from PodClique level name (if set) - - `PodGroup.TopologyConstraint.Preferred.PackLevel` ← topologyKey of strictest level + - `PodGroup.TopologyConstraint.PackConstraint.Required` ← topologyKey looked up from PodClique level name (if set) + - `PodGroup.TopologyConstraint.PackConstraint.Preferred` ← topologyKey of strictest level **Example Translation:** @@ -515,10 +540,9 @@ Operator translates to PodGang: ```yaml spec: topologyConstraint: - required: - packLevel: "topology.kubernetes.io/rack" # Operator looks up topologyKEY - preferred: - packLevel: "kubernetes.io/hostname" # Auto-generated topologyKEY of strictest level + packConstraint: + required: "topology.kubernetes.io/rack" # Operator looks up topologyKEY + preferred: "kubernetes.io/hostname" # Auto-generated topologyKEY of strictest level ``` **Per-Replica Behavior:** @@ -548,14 +572,15 @@ When a PodCliqueSet is created or updated, the Grove Operator translates it into **Translation Steps:** -1. User creates PodCliqueSet with optional `topologyConstraint.packDomain` (level name, e.g., "rack") +1. User creates PodCliqueSet with optional `topologyConstraint.packLevel` (level name, e.g., "rack") 2. Operator loads TopologyDomain "grove-topology" and builds PodGang: - Looks up topologyKey for each user-specified level name (e.g., "rack" → "topology.kubernetes.io/rack") - - **PodGang level**: Required (topologyKey from PCS name) + Preferred (topologyKey of strictest level) - - **NetworkPackGroup level**: Required (topologyKey from PCSG name) + Preferred (topologyKey of strictest level) - - **PodGroup level**: Required (topologyKey from PodClique name) + Preferred (topologyKey of strictest level) + - **PodGang level**: PackConstraint with Required (topologyKey from PCS) + Preferred (strictest topologyKey) + - **NetworkPackGroup level**: PackConstraint with Required (topologyKey from PCSG) + Preferred (strictest + topologyKey) + - **PodGroup level**: PackConstraint with Required (topologyKey from PodClique) + Preferred (strictest topologyKey) - Adds annotation `grove.run.ai/topology-name: "grove-topology"` to PodGang metadata -3. KAI scheduler reads annotation, uses topologyKeys to apply three-level topology constraints +3. KAI scheduler reads annotation, uses packConstraints to apply three-level topology constraints ### End-to-End Flow From 27acac9201dec40714227393e5c4ad5b04204191 Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 29 Oct 2025 21:24:16 +0200 Subject: [PATCH 12/15] docs: update topology documentation to remove references to SubBlock and clarify level definitions Signed-off-by: Ron Kahn --- docs/designs/topology.md | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index efd514c9..8a7fbcf2 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -94,7 +94,7 @@ TopologyDomain for KAI scheduler usage.* (also named "grove-topology") - **Default name**: "grove-topology" used when topologyDomainName not specified in operator config - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) -- **Predefined ordering**: Region > Zone > DataCenter > Block > SubBlock > Rack > Host > Numa (broadest to narrowest) +- **Predefined ordering**: Region > Zone > DataCenter > Block > Rack > Host > Numa (broadest to narrowest) - **Webhook-validated**: Webhook enforces singleton constraint (any name allowed) **TopologyLevelName Definitions:** @@ -103,7 +103,6 @@ TopologyDomain for KAI scheduler usage.* (also named "grove-topology") - **Zone**: Network local to a CSP availability-zone within a region - **DataCenter**: Network local to a data-center within a CSP availability-zone - **Block**: Network local to a switching block unit within a data-center -- **SubBlock**: Sub-switching block unit within a larger block - **Rack**: First-level network grouping of compute hosts (includes NVLink domains as logical racks) - **Host**: Individual compute host - **Numa**: NUMA node (processor and memory locality domain) within a compute host @@ -119,14 +118,13 @@ TopologyLevelRegion TopologyLevelName = "region" TopologyLevelZone TopologyLevelName = "zone" TopologyLevelDataCenter TopologyLevelName = "datacenter" TopologyLevelBlock TopologyLevelName = "block" -TopologyLevelSubBlock TopologyLevelName = "subblock" TopologyLevelRack TopologyLevelName = "rack" TopologyLevelHost TopologyLevelName = "host" TopologyLevelNuma TopologyLevelName = "numa" ) // Topology ordering (broadest to narrowest): -// Region > Zone > DataCenter > Block > SubBlock > Rack > Host > Numa +// Region > Zone > DataCenter > Block > Rack > Host > Numa // TopologyDomain defines the topology hierarchy for the cluster // This resource is immutable after creation @@ -150,9 +148,9 @@ Levels []TopologyLevel `json:"levels"` type TopologyLevel struct { // Name is the predefined level identifier used in TopologyConstraint references -// Must be one of: region, zone, datacenter, block, subblock, rack, host, numa +// Must be one of: region, zone, datacenter, block, rack, host, numa // +kubebuilder:validation:Required -// +kubebuilder:validation:Enum=region;zone;datacenter;block;subblock;rack;host;numa +// +kubebuilder:validation:Enum=region;zone;datacenter;block;rack;host;numa Name TopologyLevelName `json:"name"` // TopologyKey is the node label key that identifies this topology domain @@ -183,8 +181,6 @@ spec: topologyKey: "topology.kubernetes.io/datacenter" - name: block topologyKey: "topology.kubernetes.io/block" - - name: subblock - topologyKey: "topology.kubernetes.io/subblock" - name: rack topologyKey: "topology.kubernetes.io/rack" - name: host @@ -206,10 +202,10 @@ spec: **Validation:** - Only one TopologyDomain allowed cluster-wide (webhook enforces singleton, any name allowed) -- Level names must be from predefined set: region, zone, datacenter, block, subblock, rack, host, numa (enum validation) +- Level names must be from predefined set: region, zone, datacenter, block, rack, host, numa (enum validation) - Each level `name` and `topologyKey` must be unique - Mutation webhook automatically reorders levels to match predefined ordering (Region > Zone > DataCenter > Block > - SubBlock > Rack > Host > Numa) + Rack > Host > Numa) - Admins can skip intermediate levels (e.g., define only region, rack, host) - Immutable after creation (webhook blocks updates) - Deletion protection via controller finalizer (blocks deletion while PodCliqueSet resources exist) @@ -278,11 +274,11 @@ spec: type TopologyConstraint struct { // PackLevel specifies the topology level name for grouping replicas // Controls placement constraint for EACH individual replica instance -// Must be one of: region, zone, datacenter, block, subblock, rack, host, numa +// Must be one of: region, zone, datacenter, block, rack, host, numa // Example: "rack" means each replica independently placed within one rack // Note: Does NOT constrain all replicas to the same rack together // Different replicas can be in different topology domains -// +kubebuilder:validation:Enum=region;zone;datacenter;block;subblock;rack;host;numa +// +kubebuilder:validation:Enum=region;zone;datacenter;block;rack;host;numa PackLevel *TopologyLevelName `json:"packLevel,omitempty"` } ``` From b683fb1481aa01e36235985828e89a3376b930e0 Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Wed, 29 Oct 2025 22:34:11 +0200 Subject: [PATCH 13/15] docs: update topology documentation to reflect changes in operator config and annotations Signed-off-by: Ron Kahn --- docs/designs/topology.md | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 8a7fbcf2..797ed6a1 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -53,7 +53,7 @@ while allowing users to specify required constraints for strict placement (upper │ └──────────┬───────────┘ └───────────┬──────────┘ │ │ │ │ │ │ │ │ │ -│ Operator Config: OperatorConfiguration.EnableTopology=true │ +│ Operator Config: topology.enabled=true │ │ │ │ │ │ │ (validates against) │ (referenced by) │ ├─────────────┼───────────────────────────────────┼───────────────────────┤ @@ -196,7 +196,7 @@ spec: - Use custom name (e.g., "my-cluster-topology") OR - Use default name "grove-topology" (no config needed) 3. Create resource: `kubectl apply -f topologydomain.yaml` -4. If using custom name: configure operator with topology name in OperatorConfiguration +4. If using custom name: configure operator with topology name in operator config 5. Manually create Kueue Topology with same name and aligned levels for KAI scheduler **Validation:** @@ -239,17 +239,12 @@ Key Points: #### Operator Configuration -Operator enables/disables topology features via OperatorConfiguration manifest: +Operator enables/disables topology features via operator config: ```yaml -apiVersion: grove.io/v1alpha1 -kind: OperatorConfiguration -metadata: - name: grove-operator-config -spec: - topology: - enabled: true - topologyDomainName: "my-cluster-topology" # Optional, defaults to "grove-topology" +topology: + enabled: true + topologyDomainName: "my-cluster-topology" # Optional, defaults to "grove-topology" ``` **Startup Behavior:** @@ -259,7 +254,7 @@ spec: - Operator looks for TopologyDomain with configured name (defaults to "grove-topology") - If TopologyDomain with that name doesn't exist → operator fails to start - If `topology.enabled: false`: topology features disabled -- Admin must create TopologyDomain with matching name OR disable topology +- Admin must create TopologyDomain with matching name OR disable topology in operator config **Admin Responsibilities:** @@ -384,7 +379,7 @@ The operator adds topology information to PodGang metadata via annotation: // Annotation added to PodGang metadata: annotations: -grove.run.ai/topology-name: "" +grove.io/topology-name: "" ``` This annotation allows the scheduler to locate the Kueue Topology resource without requiring a spec field, providing @@ -464,7 +459,7 @@ Fields Added: Annotations Added: -- `grove.run.ai/topology-name: ""` - Annotation on PodGang metadata referencing topology name +- `grove.io/topology-name: ""` - Annotation on PodGang metadata referencing topology name Fields Removed: @@ -478,7 +473,7 @@ The operator translates Grove operator API to Grove Scheduler API with three-lev **Topology Annotation:** -- Operator adds annotation `grove.run.ai/topology-name: ""` to PodGang metadata +- Operator adds annotation `grove.io/topology-name: ""` to PodGang metadata - Annotation value matches the TopologyDomain name from operator configuration - KAI scheduler uses this annotation to locate the corresponding Kueue Topology CRD - Annotation approach provides API flexibility for future changes without breaking spec @@ -575,7 +570,7 @@ When a PodCliqueSet is created or updated, the Grove Operator translates it into - **NetworkPackGroup level**: PackConstraint with Required (topologyKey from PCSG) + Preferred (strictest topologyKey) - **PodGroup level**: PackConstraint with Required (topologyKey from PodClique) + Preferred (strictest topologyKey) - - Adds annotation `grove.run.ai/topology-name: "grove-topology"` to PodGang metadata + - Adds annotation `grove.io/topology-name: "grove-topology"` to PodGang metadata 3. KAI scheduler reads annotation, uses packConstraints to apply three-level topology constraints ### End-to-End Flow From 80c28d5d166a6907744913cee50c38f939bf8135 Mon Sep 17 00:00:00 2001 From: Sanjay Chatterjee Date: Thu, 30 Oct 2025 09:48:52 -0700 Subject: [PATCH 14/15] Update docs/designs/topology.md Co-authored-by: Madhav Bhargava Signed-off-by: Sanjay Chatterjee --- docs/designs/topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 797ed6a1..1e4d840d 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -93,7 +93,7 @@ TopologyDomain for KAI scheduler usage.* (also named "grove-topology") - **Cluster-scoped singleton**: Only one TopologyDomain allowed cluster-wide, user chooses name - **Default name**: "grove-topology" used when topologyDomainName not specified in operator config - **Immutable**: Once created, cannot be modified -- **List-ordered hierarchy**: Index 0 = broadest (e.g., region), last = narrowest (e.g., host) +- **List-ordered hierarchy**: Index 0 represents the broadest category (e.g., region), and the final index represents the narrowest (e.g., host). - **Predefined ordering**: Region > Zone > DataCenter > Block > Rack > Host > Numa (broadest to narrowest) - **Webhook-validated**: Webhook enforces singleton constraint (any name allowed) From 2eb68c038b3b06ee0572909468ac49b7f16c8d77 Mon Sep 17 00:00:00 2001 From: Ron Kahn Date: Thu, 30 Oct 2025 19:13:18 +0200 Subject: [PATCH 15/15] docs: clarify TopologyDomain definition and naming conventions in documentation Signed-off-by: Ron Kahn --- docs/designs/topology.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/designs/topology.md b/docs/designs/topology.md index 1e4d840d..2500073e 100644 --- a/docs/designs/topology.md +++ b/docs/designs/topology.md @@ -83,18 +83,19 @@ while allowing users to specify required constraints for strict placement (upper #### TopologyDomain CR -TopologyDomain is a cluster-scoped CR that defines the topology hierarchy for scheduling. It maps friendly level names -to Kubernetes node labels and establishes ordering from broadest to narrowest scope. +TopologyDomain is a cluster-scoped CR that defines consistent naming for cluster topology hierarchy to be used by +workload designers. It maps topology level domains to Kubernetes node labels and establishes ordering from broadest to +narrowest scope. *note: this CR is independent of Kueue Topology CRD, which must be manually created by admin to align with Grove's TopologyDomain for KAI scheduler usage.* (also named "grove-topology") **Characteristics:** -- **Cluster-scoped singleton**: Only one TopologyDomain allowed cluster-wide, user chooses name -- **Default name**: "grove-topology" used when topologyDomainName not specified in operator config +- **Cluster-scoped singleton**: Only one TopologyDomain allowed cluster-wide +- **Default name**: In operator configuration, topologyDomainName defaults to "grove-topology" when not specified - **Immutable**: Once created, cannot be modified - **List-ordered hierarchy**: Index 0 represents the broadest category (e.g., region), and the final index represents the narrowest (e.g., host). -- **Predefined ordering**: Region > Zone > DataCenter > Block > Rack > Host > Numa (broadest to narrowest) +- **Supported topology levels**: Region > Zone > DataCenter > Block > Rack > Host > Numa (broadest to narrowest) - **Webhook-validated**: Webhook enforces singleton constraint (any name allowed) **TopologyLevelName Definitions:**