diff --git a/README.md b/README.md index 3f35935f..c70b0665 100644 --- a/README.md +++ b/README.md @@ -1,74 +1,81 @@ -> [!NOTE] -> -> :construction_worker: `This project site is currently under active construction, keep watching for announcements!` - # Grove +Modern AI inference workloads need capabilities that Kubernetes doesn't provide out-of-the-box: + +- **Gang scheduling** - Prefill and decode pods must start together or not at all +- **Grouped scaling** - Tightly-coupled components that need to scale as a unit +- **Startup ordering** - Different components in a workload which must start in an explicit ordering +- **Topology-aware placement** - NVLink-connected GPUs or workloads shouldn't be scattered across nodes + +Grove is a Kubernetes API that provides a single declarative interface for orchestrating any AI inference workload — from simple, single-pod deployments to complex multi-node, disaggregated systems. Grove lets you scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. It allows you to describe your whole inference serving system in Kubernetes - e.g. prefill, decode, routing or any other component - as a single Custom Resource Definition (CRD). From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multi-level autoscaling and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers. + +**One API. Any inference architecture.** + +## Quick Start + +Get Grove running in 5 minutes: [![Go Report Card](https://goreportcard.com/badge/github.com/ai-dynamo/grove/operator)](https://goreportcard.com/report/github.com/NVIDIA/grove/operator) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/grove)](https://github.com/ai-dynamo/grove/releases/latest) [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/GF45xZAX) +[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/grove) + +```bash +# 1. Create a local kind cluster +cd operator && make kind-up -Grove is a Kubernetes API purpose-built for orchestrating AI workloads on GPU clusters. The modern inference landscape spans a wide range of workload types — from traditional single-node deployments where each model instance runs in a single pod, to large-scale disaggregated systems where one model instance may include multiple components such as prefill and decode, each distributed across many pods and nodes. Grove is designed to unify this entire spectrum under a single API, allowing developers to declaratively represent any inference workload by composing as many components as their system requires — whether single-node or multi-node — within one cohesive custom resource. +# 2. Deploy Grove +make deploy -Additionally, as workloads scale in size and complexity, achieving efficient resource utilization and optimal performance depends on capabilities such as all-or-nothing (“gang”) scheduling, topology-aware placement, prescriptive startup ordering, and independent scaling of components. Grove is designed with these needs as first-class citizens — providing native abstractions for expressing scheduling intent, topology constraints, startup dependencies, and per-component scaling behaviors that can be directly interpreted by underlying schedulers. +# 3. Deploy your first workload +kubectl apply -f samples/simple/simple1.yaml -## Core Concepts +# 4. Fetch the resources created by grove +kubectl get pcs,pclq,pcsg,pg,pod -owide +``` -The Grove API consists of a user API and a scheduling API. While the user API (`PodCliqueSet`, `PodClique`, `PodCliqueScalingGroup`) allows users to represent their AI workloads, the scheduling API (`PodGang`) enables scheduler integration to support the network topology-optimized gang-scheduling and auto-scaling requirements of the workload. +**→ [Installation Docs](docs/installation.md)** -| Concept | Description | -|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [PodCliqueSet](operator/api/core/v1alpha1/podcliqueset.go) | The top-level Grove object that defines a group of components managed and colocated together. Also supports autoscaling with topology aware spread of PodCliqueSet replicas for availability. | -| [PodClique](operator/api/core/v1alpha1/podclique.go) | A group of pods representing a specific role (e.g., leader, worker, frontend). Each clique has an independent configuration and supports custom scaling logic. | -| [PodCliqueScalingGroup](operator/api/core/v1alpha1/scalinggroup.go) | A set of PodCliques that scale and are scheduled together as a gang. Ideal for tightly coupled roles like prefill leader and worker. | -| [PodGang](scheduler/api/core/v1alpha1/podgang.go) | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. | +## What Grove Solves +Grove handles the complexities of modern AI inference deployments: -## Key Capabilities +| Your Setup | What Grove Does | +|------------|-----------------| +| **Disaggregated inference** (prefill + decode) | Gang schedules all components together, scales them independently and as a unit | +| **Multi-model pipelines** | Enforces startup order (router → workers), auto-scales each stage | +| **Multi-node inference** (DeepSeek-R1, Llama 405B) | Packs pods onto NVLink-connected GPUs for optimal network performance | +| **Simple single-pod serving** | Works for this too! One API for any architecture | -- **Declarative composition of Role-Based Pod Groups** - `PodCliqueSet` API provides users a capability to declaratively compose tightly coupled group of pods with explicit role based logic, e.g. disaggregated roles in a model serving stack such as `prefill`, `decode` and `routing`. -- **Flexible Gang Scheduling** - `PodClique`'s and `PodCliqueScalingGroup`s allow users to specify flexible gang-scheduling requirements at multiple levels within a `PodCliqueSet` to prevent resource deadlocks. -- **Multi-level Horizontal Auto-Scaling** - Supports pluggable horizontal auto-scaling solutions to scale `PodCliqueSet`, `PodClique` and `PodCliqueScalingGroup` custom resources. -- **Network Topology-Aware Scheduling** - Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability. -- **Custom Startup Dependencies** - Prescribe the order in which the `PodClique`s must start in a declarative specification. Pod startup is decoupled from pod creation or scheduling. -- **Resource-Aware Rolling Updates** - Supports reuse of resource reservations of `Pod`s during updates in order to preserve topology-optimized placement. +**Use Cases:** [Multi-node disaggregated](docs/assets/multinode-disaggregated.excalidraw.png) · [Single-node disaggregated](docs/assets/singlenode-disaggregated.excalidraw.png) · [Agentic pipelines](docs/assets/agentic-pipeline.excalidraw.png) · [Standard serving](docs/assets/singlenode-aggregated.excalidraw.png) -## Example Use Cases +## How It Works -- **Multi-Node, Disaggregated Inference for large models** ***(DeepSeek-R1, Llama-4-Maverick)*** : [Visualization](docs/assets/multinode-disaggregated.excalidraw.png) -- **Single-Node, Disaggregated Inference** : [Visualization](docs/assets/singlenode-disaggregated.excalidraw.png) -- **Agentic Pipeline of Models** : [Visualization](docs/assets/agentic-pipeline.excalidraw.png) -- **Standard Aggregated Single Node or Single GPU Inference** : [Visualization](docs/assets/singlenode-aggregated.excalidraw.png) +Grove introduces four simple concepts: -## Getting Started +| Concept | What It Does | +|---------|--------------| +| **PodCliqueSet** | Your entire workload (e.g., "my-inference-stack") | +| **PodClique** | A component role (e.g., "prefill", "decode", "router") | +| **PodCliqueScalingGroup** | Components that must scale together (e.g., prefill + decode) | +| **PodGang** | Internal scheduler primitive for gang scheduling (you don't touch this) | -You can get started with the Grove operator by following our [installation guide](docs/installation.md). +**→ [API Reference](docs/api-reference/operator-api.md)** ## Roadmap ### 2025 Priorities -Update: We are aligning our release schedule with [Nvidia Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Once our release cadence (e.g., weekly, monthly) is finalized, it will be reflected here. +> **Note:** We are aligning our release schedule with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Release dates will be updated once our cadence (e.g., weekly, monthly) is finalized. -**Release v0.1.0** *(ETA: Mid September 2025)* -- Grove v1alpha1 API -- Hierarchical Gang Scheduling and Gang Termination +**Q4 2025** +- Topology-Aware Scheduling - Multi-Level Horizontal Auto-Scaling - Startup Ordering - Rolling Updates -**Release v0.2.0** *(ETA: October 2025)* -- Topology-Aware Scheduling +**Q1 2026** - Resource-Optimized Rolling Updates - -**Release v0.3.0** *(ETA: November 2025)* - Multi-Node NVLink Auto-Scaling Support ## Contributions diff --git a/docs/installation.md b/docs/installation.md index b6fa3d54..aa09c449 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -11,7 +11,7 @@ You can use the published [Helm `grove-charts` package](https://github.com/ai-dy helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts: ``` -You could also deploy Grove to your cluster through the provided make targets, by following [installation using make targets](#installation-using-make-targets). +You could also deploy Grove to your cluster through the provided make targets, by following [remote cluster setup](#remote-cluster-set-up) and [installation using make targets](#installation-using-make-targets). ## Developing Grove @@ -23,12 +23,34 @@ All grove operator Make targets are located in [Operator Makefile](../operator/M In case you wish to develop Grove using a local kind cluster, please do the following: -- To set up a KIND cluster with local docker registry run the following command: +- **Navigate to the operator directory:** + + ```bash + cd operator + ``` + +- **Set up a KIND cluster with local docker registry:** ```bash make kind-up ``` +- **Optional**: To create a KIND cluster with fake nodes for testing at scale, specify the number of fake nodes: + + ```bash + # Create a cluster with 20 fake nodes + make kind-up FAKE_NODES=20 + ``` + + This will automatically install [KWOK](https://kwok.sigs.k8s.io/) (Kubernetes WithOut Kubelet) and create the specified number of fake nodes. These fake nodes are tainted with `fake-node=true:NoSchedule`, so you'll need to add the following toleration to your pod specs to schedule on them: + + ```yaml + tolerations: + - key: fake-node + operator: Exists + effect: NoSchedule + ``` + - Specify the `KUBECONFIG` environment variable in your shell session to the path printed out at the end of the previous step: ```bash @@ -57,10 +79,16 @@ If you wish to use your own Kubernetes cluster instead of the KIND cluster, foll ### Installation using make targets +> **Important:** All commands in this section must be run from the `operator/` directory. + ```bash -# If you wish to deploy all Grove Operator resources in a custom namespace then set the `NAMESPACE` environment variable +# Navigate to the operator directory (if not already there) +cd operator + +# Optional: Deploy to a custom namespace export NAMESPACE=custom-ns -# if `NAMESPACE` environment variable is set then `make deploy` target will use this namespace to deploy all Grove operator resources + +# Deploy Grove operator and all resources make deploy ``` @@ -76,6 +104,8 @@ This make target leverages Grove [Helm](https://helm.sh/) charts and [Skaffold]( ## Deploy a `PodCliqueSet` +> **Important:** Ensure you're in the `operator/` directory for the relative path to work. + - Deploy one of the samples present in the [samples](../operator/samples/simple) directory. ```bash @@ -127,7 +157,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are - Let's try scaling the `PodCliqueScalingGroup` from 1 to 2 replicas: ```bash - kubectl scale pcsg simple1-0-pcsg --replicas=2 + kubectl scale pcsg simple1-0-sga --replicas=2 ``` This will create new pods that associate with cliques that belong to this scaling group, and their associated `PodGang`s. @@ -176,7 +206,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are Similarly, the `PodCliqueScalingGroup` can be scaled back in to 1 replicas like so: ```bash - kubectl scale pcsg simple1-0-pcsg --replicas=1 + kubectl scale pcsg simple1-0-sga --replicas=1 ``` - Scaling can also be triggered at the `PodCliqueSet` level, as can be seen here: @@ -236,6 +266,94 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are kubectl scale pcs simple1 --replicas=1 ``` +## Troubleshooting + +### Deployment Issues + +#### `make deploy` fails with "No rule to make target 'deploy'" + +**Cause:** You're running the command from the wrong directory. + +**Solution:** Ensure you're in the `operator/` directory: +```bash +cd operator +make deploy +``` + +#### `make deploy` fails with "unable to connect to Kubernetes" + +**Cause:** The `KUBECONFIG` environment variable is not set correctly. + +**Solution:** Export the kubeconfig for your kind cluster: +```bash +kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig +export KUBECONFIG=$(pwd)/hack/kind/kubeconfig +make deploy +``` + +#### Grove operator pod is in `CrashLoopBackOff` + +**Cause:** Check the operator logs for specific errors. + +**Solution:** +```bash +kubectl logs -l app.kubernetes.io/name=grove-operator +``` + +### Runtime Issues + +#### Pods stuck in `Pending` state + +**Cause:** Gang scheduling requirements might not be met, or there aren't enough resources. + +**Solution:** +1. Check PodGang status: + ```bash + kubectl get pg -o yaml + ``` +2. Check if MinAvailable requirements can be satisfied by your cluster resources +3. Check node resources: + ```bash + kubectl describe nodes + ``` + +#### `kubectl scale` command fails with "not found" + +**Cause:** The resource name might be incorrect. + +**Solution:** List the actual resource names first: +```bash +# For PodCliqueScalingGroups +kubectl get pcsg + +# For PodCliqueSets +kubectl get pcs +``` + +Then use the exact name from the output. + +#### PodCliqueScalingGroup not auto-scaling + +**Cause:** HPA might not be created or metrics-server might be missing. + +**Solution:** +1. Verify HPA exists: + ```bash + kubectl get hpa + ``` +2. Check if metrics-server is running (required for HPA): + ```bash + kubectl get deployment metrics-server -n kube-system + ``` +3. For kind clusters, you may need to install metrics-server separately. + +### Getting Help + +If you encounter issues not covered here: +1. Check the [GitHub Issues](https://github.com/NVIDIA/grove/issues) for similar problems +2. Join the [Grove mailing list](https://groups.google.com/g/grove-k8s) +3. Start a [discussion thread](https://github.com/NVIDIA/grove/discussions) + ## Supported Schedulers Currently the following schedulers support gang scheduling of `PodGang`s created by the Grove operator: diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..d3aeed03 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,220 @@ +# Quickstart: Deploy Your First Workload with Grove + +This guide will walk you through deploying a simple disaggregated AI workload using Grove in about 10 minutes. + +## What You'll Learn + +By the end of this quickstart, you'll understand how to: +- Deploy a multi-component workload with Grove +- Scale components independently or as a group +- Observe gang scheduling in action + +## Prerequisites + +- A Kubernetes cluster (we'll use kind for local testing) +- `kubectl` installed and configured +- Docker Desktop running (for kind) +- 10-15 minutes + +## Understanding the Example Workload + +We'll deploy a simple workload that mimics a disaggregated inference setup with four components: + +- **Role A (pca)**: Auto-scaling component (e.g., routing layer) - scales based on CPU +- **Role B (pcb)** and **Role C (pcc)**: Tightly-coupled components (e.g., prefill + decode) - must scale together as a group +- **Role D (pcd)**: Fixed-size component (e.g., model cache) - doesn't auto-scale + +This demonstrates Grove's key capabilities: individual auto-scaling, grouped scaling, and gang scheduling. + +## Step 1: Set Up Your Local Environment + +### Install Grove Operator + +Follow the [installation guide](installation.md) to: +1. Create a kind cluster with `make kind-up` +2. Export the KUBECONFIG +3. Deploy Grove with `make deploy` + +**Quick check:** Verify Grove operator is running: +```bash +kubectl get pods -l app.kubernetes.io/name=grove-operator +``` + +You should see a pod in `Running` status. + +## Step 2: Deploy Your First PodCliqueSet + +Create a PodCliqueSet that defines all four components: + +```bash +cd operator +kubectl apply -f samples/simple/simple1.yaml +``` + +**What just happened?** Grove created: +- 1 `PodCliqueSet` (the top-level resource) +- 4 `PodCliques` (one for each component role) +- 1 `PodCliqueScalingGroup` (grouping pcb and pcc) +- 1 `PodGang` (for gang scheduling) +- 9 `Pods` (3 for pca, 2 each for pcb/pcc/pcd) + +## Step 3: Observe the Resources + +Watch as Grove creates and schedules all resources: + +```bash +kubectl get pcs,pclq,pcsg,pg,pod -owide +``` + +Expected output: +``` +NAME AGE +podcliqueset.grove.io/simple1 34s + +NAME AGE +podclique.grove.io/simple1-0-pca 33s +podclique.grove.io/simple1-0-pcd 33s +podclique.grove.io/simple1-0-sga-0-pcb 33s +podclique.grove.io/simple1-0-sga-0-pcc 33s + +NAME AGE +podcliquescalinggroup.grove.io/simple1-0-sga 33s + +NAME AGE +podgang.scheduler.grove.io/simple1-0 33s + +NAME READY STATUS RESTARTS AGE +pod/grove-operator-699c77979f-7x2zc 1/1 Running 0 51s +pod/simple1-0-pca-pkl2b 1/1 Running 0 33s +pod/simple1-0-pca-s7dz2 1/1 Running 0 33s +pod/simple1-0-pca-wjfqz 1/1 Running 0 33s +pod/simple1-0-pcd-l4vnk 1/1 Running 0 33s +pod/simple1-0-pcd-s7687 1/1 Running 0 33s +pod/simple1-0-sga-0-pcb-m9shj 1/1 Running 0 33s +pod/simple1-0-sga-0-pcb-vnrqw 1/1 Running 0 33s +pod/simple1-0-sga-0-pcc-g8rg8 1/1 Running 0 33s +pod/simple1-0-sga-0-pcc-hx4zn 1/1 Running 0 33s +``` + +**Key observation:** Notice all pods reached `Running` state together - that's gang scheduling! Grove ensured all pods were scheduled before starting any of them. + +## Step 4: Scale a PodCliqueScalingGroup + +Scale the tightly-coupled components (pcb and pcc) as a group: + +```bash +kubectl scale pcsg simple1-0-sga --replicas=2 +``` + +Observe the new resources: +```bash +kubectl get pcs,pclq,pcsg,pg,pod -owide +``` + +**What happened?** +- Grove created a new replica of the scaling group +- Both `pcb` and `pcc` scaled together from 2 to 4 pods each +- A new `PodGang` was created for gang scheduling the new replica +- New `PodCliques` appeared: `simple1-0-sga-1-pcb` and `simple1-0-sga-1-pcc` + +Scale back down: +```bash +kubectl scale pcsg simple1-0-sga --replicas=1 +``` + +## Step 5: Scale the Entire PodCliqueSet + +Scale the entire workload to create a complete second instance: + +```bash +kubectl scale pcs simple1 --replicas=2 +``` + +Check the resources: +```bash +kubectl get pcs,pclq,pcsg,pg,pod -owide +``` + +**What happened?** +- Grove created a complete duplicate of your workload +- All new resources have `-1-` in their names (second replica) +- A new `PodGang` (`simple1-1`) was created +- All components (pca, pcb, pcc, pcd) were duplicated + +Scale back: +```bash +kubectl scale pcs simple1 --replicas=1 +``` + +## Step 6: Understand the Hierarchy + +Let's visualize what you just created: + +``` +PodCliqueSet (simple1) +├── Replica 0 (PodGang: simple1-0) +│ ├── PodClique: pca (3 pods) +│ ├── PodClique: pcd (2 pods) +│ └── PodCliqueScalingGroup: sga +│ ├── PodClique: pcb (2 pods) +│ └── PodClique: pcc (2 pods) +└── Replica 1 (when scaled to 2) + └── [same structure] +``` + +**Scaling behaviors:** +- `kubectl scale pcs simple1`: Creates/removes complete replicas +- `kubectl scale pcsg simple1-0-sga`: Scales just the grouped components (pcb + pcc) +- Auto-scaling (when CPU threshold is met): Automatically scales `pca` or the scaling group + +## Step 7: Clean Up + +Remove the sample workload: + +```bash +kubectl delete -f samples/simple/simple1.yaml +``` + +Verify cleanup: +```bash +kubectl get pcs,pclq,pcsg,pg,pod +``` + +Only the Grove operator pod should remain. + +## What's Next? + +Now that you understand the basics, explore: + +- **[Installation Guide](installation.md)** - Learn about remote cluster deployment +- **[API Reference](api-reference/operator-api.md)** - Deep dive into all configuration options +- **[Samples](../operator/samples/)** - Explore more complex examples +- **Auto-scaling** - Trigger CPU-based auto-scaling by generating load +- **Startup Ordering** - Define dependencies between components + +## Key Concepts Recap + +| Concept | What It Does | When to Use | +|---------|--------------|-------------| +| **PodCliqueSet** | Top-level resource defining your workload | Every Grove deployment | +| **PodClique** | Group of pods with the same role | Each component type in your system | +| **PodCliqueScalingGroup** | Multiple PodCliques that scale together | Tightly-coupled components (prefill+decode) | +| **PodGang** | Ensures all components are co-scheduled | Automatically created by Grove | + +## Troubleshooting + +### Pods stuck in Pending +- Check PodGang status: `kubectl describe pg simple1-0` +- Ensure your cluster has enough resources for gang scheduling + +### Auto-scaling not working +- For kind clusters, install metrics-server: + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml + ``` +- Add `--kubelet-insecure-tls` to metrics-server deployment for kind + +### Need help? +- See the full [Troubleshooting Guide](installation.md#troubleshooting) +- Join the [Grove mailing list](https://groups.google.com/g/grove-k8s) +- File an [issue](https://github.com/NVIDIA/grove/issues) diff --git a/docs/user-guide/core-concepts/overview.md b/docs/user-guide/core-concepts/overview.md new file mode 100644 index 00000000..5633771f --- /dev/null +++ b/docs/user-guide/core-concepts/overview.md @@ -0,0 +1,34 @@ +# Grove Core Concepts Tutorial + +This tutorial provides a comprehensive overview of Grove's core concepts: **PodClique**, **PodCliqueSet**, and **PodCliqueScalingGroup**. Through practical examples, you'll learn how to deploy and scale inference workloads from simple single-node setups to complex multi-node distributed systems. Since Grove's creation was motivated by inference the examples are tailored to inference but the core idea is to demonstrate how Grove's primitives allow you to express a collection of single node and multinode components that require tighter coupling from a scheduling (and in future releases network topology) aspect. + +## Prerequisites + +Before starting this tutorial, ensure you have: +- [A Grove demo cluster running.](../../installation.md#developing-grove) Make sure to run `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed in the instructions, and run `make deploy` +- [A Kubernetes cluster with Grove installed.](../../installation.md#deploying-grove) If you choose this path make sure to adjust the tolerations in the example to fit your cluster +- A basic understanding of Kubernetes concepts, [this is a good place to start](https://kubernetes.io/docs/tutorials/kubernetes-basics/). + + +## Core Concepts Overview + +### PodClique: The Fundamental Unit +A **PodClique** is the core building block in Grove. It represents a group of pods with the same exact configuration - similar to a ReplicaSet, but with gang termination behavior. It can be used in a standalone manner to represent single-node components (components where each instance fits within one node and can be represented by one pod) of your system, or can represent roles within a multi-node component such as leader and worker. + +### PodCliqueScalingGroup: Multi-Node Coordination +A **PodCliqueScalingGroup** coordinates multiple PodCliques that must scale together, preserving specified replica ratios across roles (e.g. leader/worker) in multi-node components (components where each instance spans multiple pods often on different nodes). + +### PodCliqueSet: The Inference Service Container +A **PodCliqueSet** contains all the components for a complete service. It manages one or more PodCliques or PodCliqueScalingGroups that work together to form a functional system. PodCliqueSet replicas enable system-level scaling use cases such as deploying multiple complete instances of your inference stack (e.g., for canary deployments, A/B testing, or spreading across availability zones for high availability). + +### Understanding Scaling Levels + +Grove provides three levels of scaling to match different operational needs: + +- **Scale PodCliqueSet replicas** (`kubectl scale pcs ...`) - Replicate your entire inference service with all its components. Use this for system-level operations like canary deployments, A/B testing, or spreading across availability zones for high availability. + +- **Scale PodCliqueScalingGroup replicas** (`kubectl scale pcsg ...`) - Add more instances of a multi-node component within your service. Use this when you need more capacity of a specific multi-node component (e.g., add another leader+workers unit). + +- **Scale PodClique replicas** (`kubectl scale pclq ...`) - Adjust the number of pods in a specific role. Use this for fine-tuning individual components (e.g., add more workers to an existing leader-worker group). + +In the [next guide](./pcs_and_pclq_intro.md) we go through some examples showcasing PodCliqueSet and PodClique \ No newline at end of file diff --git a/docs/user-guide/core-concepts/pcs_and_pclq_intro.md b/docs/user-guide/core-concepts/pcs_and_pclq_intro.md new file mode 100644 index 00000000..427b4072 --- /dev/null +++ b/docs/user-guide/core-concepts/pcs_and_pclq_intro.md @@ -0,0 +1,217 @@ +# PodCliqueSet and PodClique + +In this guide we go over some hands-on examples showcasing how to use PodCliqueSet and PodClique + +Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide. + +## Example 1: Single-Node Aggregated Inference + +In this simplest scenario, each pod is a complete model instance that can service requests. This is mapped to a single standalone PodClique within the PodCliqueSet. The PodClique provides horizontal scaling capabilities at the model replica level similar to a ReplicaSet (with gang termination behavior), and the PodCliqueSet provides horizontal scaling capabilities at the system level (useful for things such as canary deployments, A/B testing, and spreading across availability zones for high availability). + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: single-node-aggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: model-worker + spec: + replicas: 2 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" +``` + +### **Key Points:** +- Single PodClique named `model-worker` +- `replicas: 2` creates 2 instances for horizontal scaling +- Each replica handles complete inference pipeline +- Tolerations allow scheduling on fake nodes for demo, remove if you are trying to deploy on a real cluster + +### **Deploy:** +```bash +# actual single-node-aggregated.yaml file is in samples/user-guide/concept-overview, change path accordingly +kubectl apply -f [single-node-aggregated.yaml](../../operator/samples/user-guide/concept-overview/single-node-aggregated.yaml) +kubectl get pods -l app.kubernetes.io/part-of=single-node-aggregated -o wide +``` + +If you are using the demo-cluster you should observe output similar to +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=single-node-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +single-node-aggregated-0-model-worker-n9gcq 1/1 Running 0 18m 10.244.7.0 fake-node-007 +single-node-aggregated-0-model-worker-zfhbb 1/1 Running 0 18m 10.244.15.0 fake-node-015 +``` +The demo-cluster consists of fake nodes spawned by [KWOK](https://kwok.sigs.k8s.io/) so the pods won't have any logs, but if you deployed to a real cluster you should observe the echo command complete successfully. Note that the spawned pods have descriptive names. Grove intentionally aims to allow users to immediately be able to map pods to their specifications in the yaml. All pods are prefixed with `single-node-aggregated-0` to represent they are part of the first replica of the `single-node-aggregated` PodCliqueSet. After the PodCliqueSet identifier is `model-worker`, signifying that the pods belong to the `model-worker` PodClique. + +### **Scaling** +As mentioned earlier, you can scale the `model-worker` PodClique to get more model replicas similar to a ReplicaSet. For instance run the following command to increase the replicas on `model-worker` from 2 to 4. `pclq` is short for PodClique and can be used to reference PodClique as a resource in kubectl commands. Note that the name of the PodClique provided to the scaling command is `single-node-aggregated-0-model-worker` and not just `model-worker`. This is necessary since the PodCliqueSet can be replicated (as we will see later) and therefore the name of PodCliques includes the PodCliqueSet replica they belong to. +```bash +kubectl scale pclq single-node-aggregated-0-model-worker --replicas=4 +``` +After running you will observe there are now 4 `model-worker` pods belonging to the `single-node-aggregated-0` PodCliqueSet +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=single-node-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +single-node-aggregated-0-model-worker-jvgdd 1/1 Running 0 22s 10.244.11.0 fake-node-011 +single-node-aggregated-0-model-worker-n9gcq 1/1 Running 0 44m 10.244.7.0 fake-node-007 +single-node-aggregated-0-model-worker-tjb78 1/1 Running 0 22s 10.244.8.0 fake-node-008 +single-node-aggregated-0-model-worker-zfhbb 1/1 Running 0 44m 10.244.15.0 fake-node-015 +``` +You can also scale the entire PodCliqueSet. For instance run the following command to increase the replicas on `single-node-aggregated` to 3. `pcs` is short for PodCliqueSet and can be used to reference PodCliqueSet as a resource in kubectl commands. + +```bash +kubectl scale pcs single-node-aggregated --replicas=3 +``` +After running you will observe +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=single-node-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +single-node-aggregated-0-model-worker-2xl58 1/1 Running 0 99s 10.244.7.0 fake-node-007 +single-node-aggregated-0-model-worker-5jlfr 1/1 Running 0 2m7s 10.244.20.0 fake-node-020 +single-node-aggregated-0-model-worker-78974 1/1 Running 0 2m7s 10.244.5.0 fake-node-005 +single-node-aggregated-0-model-worker-zn888 1/1 Running 0 99s 10.244.13.0 fake-node-013 +single-node-aggregated-1-model-worker-kkmsq 1/1 Running 0 74s 10.244.10.0 fake-node-010 +single-node-aggregated-1-model-worker-pn5cm 1/1 Running 0 74s 10.244.15.0 fake-node-015 +single-node-aggregated-2-model-worker-h5xqk 1/1 Running 0 74s 10.244.3.0 fake-node-003 +single-node-aggregated-2-model-worker-p4kjj 1/1 Running 0 74s 10.244.16.0 fake-node-016 +``` + +Note how there are pods belonging to `single-node-aggregated-0`, `single-node-aggregated-1`, and `single-node-aggregated-2`, representing 3 different PodCliqueSets. Also note how `single-node-aggregated-1` and `single-node-aggregated-2` only have two replicas in their `model-worker` PodClique. This is in line with k8s patterns and occurs because the template that was applied (single-node-aggregated.yaml) specified the number of replicas on `model-worker` as 2. To scale them up you would have to apply `kubectl scale pclq` commands like previously done for `single-node-aggregated-0-model-worker` above. + +### Cleanup +To teardown the example delete the `single-node-aggregated` PodCliqueSet, the operator will tear down all the constituent pieces + +```bash +kubectl delete pcs single-node-aggregated +``` + +--- + +## Example 2: Single-Node Disaggregated Inference + +Here we separate prefill and decode operations into different workers, allowing independent scaling of each component. Modelling this in Grove primitives is simple, in the previous example that demonstrated aggregated serving, we had one PodClique for the model-worker, which handled both prefill and decode. To disaggregate prefill and decode, we simply create two PodCliques, one for prefill, and one for decode. Note that the clique names can be set to whatever your want, although we recommend setting them up to match the component they represent (e.g prefill, decode). + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: single-node-disaggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: prefill + spec: + roleName: prefill + replicas: 3 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: decode + spec: + roleName: decode + replicas: 2 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" +``` + +### **Key Points:** +- Two separate PodCliques: `prefill` and `decode` +- Independent scaling: We start with 3 prefill workers, 2 decode workers and can scale them independently based on workload characteristics +- Different resource requirements for each component are supported (in the example prefill requests 2 cpu and decode only 1) + +### **Deploy** +```bash +# actual single-node-disaggregated.yaml file is in samples/user-guide/concept-overview, change path accordingly +kubectl apply -f [single-node-disaggregated.yaml](../../operator/samples/user-guide/concept-overview/single-node-disaggregated.yaml) +kubectl get pods -l app.kubernetes.io/part-of=single-node-disaggregated -o wide +``` +After running you will observe + +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=single-node-disaggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +single-node-disaggregated-0-decode-bvl94 1/1 Running 0 29s 10.244.6.0 fake-node-006 +single-node-disaggregated-0-decode-wlqxj 1/1 Running 0 29s 10.244.4.0 fake-node-004 +single-node-disaggregated-0-prefill-tnw26 1/1 Running 0 29s 10.244.17.0 fake-node-017 +single-node-disaggregated-0-prefill-xvvtk 1/1 Running 0 29s 10.244.11.0 fake-node-011 +single-node-disaggregated-0-prefill-zglvn 1/1 Running 0 29s 10.244.9.0 fake-node-009 +``` +Note how within the `single-node-disaggregated-0` PodCliqueSet replica there are pods from the `prefill` PodClique and `decode` PodClique + +### **Scaling** +You can scale the `prefill` and `decode` PodCliques the same way the [`model-worker` PodClique was scaled](#scaling) in the previous example. + +Additionally, the `single-node-disaggregated` PodCliqueSet can be scaled the same way the `single-node-aggregated` PodCliqueSet was scaled in the previous example. We show an example to demonstrate how when PodCliqueSets are scaled, all constituent PodCliques are replicated, underscoring why scaling PodCliqueSets should be treated as scaling the entire system (useful for canary deployments, A/B testing, or high availability across zones). + +```bash +kubectl scale pcs single-node-aggregated --replicas=2 +``` +After running this you will observe +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=single-node-disaggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +single-node-disaggregated-0-decode-9fvsj 1/1 Running 0 77s 10.244.13.0 fake-node-013 +single-node-disaggregated-0-decode-xw62b 1/1 Running 0 77s 10.244.18.0 fake-node-018 +single-node-disaggregated-0-prefill-dfss8 1/1 Running 0 77s 10.244.8.0 fake-node-008 +single-node-disaggregated-0-prefill-fgkrc 1/1 Running 0 77s 10.244.14.0 fake-node-014 +single-node-disaggregated-0-prefill-ljnms 1/1 Running 0 77s 10.244.11.0 fake-node-011 +single-node-disaggregated-1-decode-f9tmf 1/1 Running 0 10s 10.244.16.0 fake-node-016 +single-node-disaggregated-1-decode-psd6h 1/1 Running 0 10s 10.244.10.0 fake-node-010 +single-node-disaggregated-1-prefill-2mktc 1/1 Running 0 10s 10.244.7.0 fake-node-007 +single-node-disaggregated-1-prefill-4smsf 1/1 Running 0 10s 10.244.3.0 fake-node-003 +single-node-disaggregated-1-prefill-5n6qv 1/1 Running 0 10s 10.244.12.0 fake-node-012 +``` +Note how now there is `single-node-disaggregated-0` and `single-node-disaggregated-1` each with their own `prefill` and `decode` PodCliques that can be scaled. + +### Cleanup +To teardown the example delete the `single-node-disaggregated` PodCliqueSet, the operator will tear down all the constituent pieces + +```bash +kubectl delete pcs single-node-disaggregated +``` + +In the [next guide](./pcsg_intro.md) we showcase how to use PodCliqueScalingGroup to represent multi-node components diff --git a/docs/user-guide/core-concepts/pcsg_intro.md b/docs/user-guide/core-concepts/pcsg_intro.md new file mode 100644 index 00000000..b910d202 --- /dev/null +++ b/docs/user-guide/core-concepts/pcsg_intro.md @@ -0,0 +1,317 @@ +# PodCliqueScalingGroup + +In the [previous guide](./pcs_and_pclq_intro.md) we covered some hands on examples on how to use PodCliqueSet and PodCliqueScalingGroup. In this guide we go over some hands-on exampels on how to use PodCliqueScalingGroup to represent multinode components. + +Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide. + +## Example 3: Multi-Node Aggregated Inference + +Now we introduce **PodCliqueScalingGroup** for multi-node deployments, where multiple pods collectively make up a single instance of the application and must scale together. +These setups are increasingly common for serving large models that do not fit on one node and consequently one model instance ends up spanning multiple nodes and therefore multiple pods. In thse cases, inference frameworks typically follow a leader-worker topology: one leader pod coordinates work for N workers that connect to it. +Scaling out means replicating the entire unit (1 leader + N workers) to create additional model instances. +A PodCliqueScalingGroup encodes this by grouping the relevant PodCliques and scaling them in lockstep while preserving the pod ratios. +The example below shows how to model this leader-worker pattern in Grove: + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: multinode-aggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: leader + spec: + roleName: leader + replicas: 1 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Leader (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: worker + spec: + roleName: worker + replicas: 3 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + podCliqueScalingGroups: + - name: model-instance + cliqueNames: [leader, worker] + replicas: 2 +``` + +### **Key Points:** +- **PodCliqueScalingGroup** named `model-cluster` with `replicas: 2` +- Creates 2 model isntances, each with 1 leader + 3 workers +- Total pods: 2 × (1 leader + 3 workers) = 8 pods +- Scaling the group preserves the 1:3 leader-to-worker ratio + +### **Deploy:** +```bash +kubectl apply -f samples/user-guide/concept-overview/multi-node-aggregated.yaml +kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide +After running you should observe + +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +multinode-aggregated-0-model-instance-0-leader-zq4j5 1/1 Running 0 11s 10.244.2.0 fake-node-002 +multinode-aggregated-0-model-instance-0-worker-7kcv7 1/1 Running 0 11s 10.244.13.0 fake-node-013 +multinode-aggregated-0-model-instance-0-worker-829k9 1/1 Running 0 11s 10.244.7.0 fake-node-007 +multinode-aggregated-0-model-instance-0-worker-vrmrb 1/1 Running 0 11s 10.244.10.0 fake-node-010 +multinode-aggregated-0-model-instance-1-leader-t8ptp 1/1 Running 0 11s 10.244.6.0 fake-node-006 +multinode-aggregated-0-model-instance-1-worker-bscfv 1/1 Running 0 11s 10.244.4.0 fake-node-004 +multinode-aggregated-0-model-instance-1-worker-sgd6r 1/1 Running 0 11s 10.244.17.0 fake-node-017 +multinode-aggregated-0-model-instance-1-worker-vpkwb 1/1 Running 0 11s 10.244.18.0 fake-node-018 +``` +Note how within the same `multinode-aggregated-0` PodCliqueSet there are two replicas of the `model-instance` PodCliqueScalingGroup, `model-instance-0` and `model-instance-1`, each consisting of a `leader` PodClique with one replica and a `worker` PodClique with 3 replicas. + +### **Scaling** + +As mentioned before, PodCliqueScalingGroups represent "super-pods" where scaling means replicating the pods in constituent PodCliques together while preserving the ratios. To illustrate this, run the following command to scale the replicas of the `model-instance` PodCliqueScalingGroup from two to three. `pcsg` is short for PodCliqueScalingGroup and can be used to reference PodCliqueScalingGroup as a resource in kubectl commands. Similar to standalone PodCliques, PodCliqueScalingGroups include the name of the PodCliqueSet in their name to disambiguate from replicas of the same PodCliqueScalingGroup in a different PodCliqueSet. This is why the scaling command references `multinode-aggregated-0-model-instance` instead of `model-instance` + +```bash +kubectl scale pcsg multinode-aggregated-0-model-instance --replicas=3 +``` +After running this command you should observe + +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +multinode-aggregated-0-model-instance-0-leader-zq4j5 1/1 Running 0 68m 10.244.2.0 fake-node-002 +multinode-aggregated-0-model-instance-0-worker-7kcv7 1/1 Running 0 68m 10.244.13.0 fake-node-013 +multinode-aggregated-0-model-instance-0-worker-829k9 1/1 Running 0 68m 10.244.7.0 fake-node-007 +multinode-aggregated-0-model-instance-0-worker-vrmrb 1/1 Running 0 68m 10.244.10.0 fake-node-010 +multinode-aggregated-0-model-instance-1-leader-t8ptp 1/1 Running 0 68m 10.244.6.0 fake-node-006 +multinode-aggregated-0-model-instance-1-worker-bscfv 1/1 Running 0 68m 10.244.4.0 fake-node-004 +multinode-aggregated-0-model-instance-1-worker-sgd6r 1/1 Running 0 68m 10.244.17.0 fake-node-017 +multinode-aggregated-0-model-instance-1-worker-vpkwb 1/1 Running 0 68m 10.244.18.0 fake-node-018 +multinode-aggregated-0-model-instance-2-leader-w5wfm 1/1 Running 0 25s 10.244.19.0 fake-node-019 +multinode-aggregated-0-model-instance-2-worker-59qm9 1/1 Running 0 25s 10.244.14.0 fake-node-014 +multinode-aggregated-0-model-instance-2-worker-9qqnx 1/1 Running 0 25s 10.244.20.0 fake-node-020 +multinode-aggregated-0-model-instance-2-worker-qqnl8 1/1 Running 0 25s 10.244.5.0 fake-node-005 +``` +Note how now there is now an additional leader pod `multinode-aggregated-0-model-instance-2-leader` and 3 additional worker pods `multinode-aggregated-0-model-instance-2-leader`. This demonstrates how PodCliqueScalingGroups allow you to create "super-pods" that are a group of pods that scale together. + +While you can scale the PodCliqueScalingGroup to replicate the "super-pod" unit, you can still scale the individual PodCliques on a given PodCliqueScalingGroup replica. Before showing an example of that it is important to explain that the naming format of PodCliques that are in a PodCliqueScalingGroup is different than for standalone PodCliques. For standalone PodCliques the format is `--` whereas for PodCliques that are part of a PodCliqueScalingGroup, the format is `----`. To illustrate this run the following command to show the names of the leader and worker PodCliques + +```bash +kubectl get pclq +``` +After running this you should observe the following PodCliques, with the naming format in line with what we described above. +``` +rohanv@rohanv-mlt operator % kubectl get pclq +NAME AGE +multinode-aggregated-0-model-instance-0-leader 95m +multinode-aggregated-0-model-instance-0-worker 95m +multinode-aggregated-0-model-instance-1-leader 95m +multinode-aggregated-0-model-instance-1-worker 95m +multinode-aggregated-0-model-instance-2-leader 27m +multinode-aggregated-0-model-instance-2-worker 27m +``` +Now that we know the PodClique names we can scale the replicas on a specific PodClique similar to previous examples. Run the following command to increase `multinode-aggregated-0-model-instance-0-worker` from three replicas to four + +```bash +kubectl scale pclq multinode-aggregated-0-model-instance-0-worker --replicas=4 +``` +After running this you will observe: + +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +multinode-aggregated-0-model-instance-0-leader-zq4j5 1/1 Running 0 12h 10.244.2.0 fake-node-002 +multinode-aggregated-0-model-instance-0-worker-7kcv7 1/1 Running 0 12h 10.244.13.0 fake-node-013 +multinode-aggregated-0-model-instance-0-worker-829k9 1/1 Running 0 12h 10.244.7.0 fake-node-007 +multinode-aggregated-0-model-instance-0-worker-gjc87 1/1 Running 0 83s 10.244.1.0 fake-node-001 +multinode-aggregated-0-model-instance-0-worker-vrmrb 1/1 Running 0 12h 10.244.10.0 fake-node-010 +multinode-aggregated-0-model-instance-1-leader-t8ptp 1/1 Running 0 12h 10.244.6.0 fake-node-006 +multinode-aggregated-0-model-instance-1-worker-bscfv 1/1 Running 0 12h 10.244.4.0 fake-node-004 +multinode-aggregated-0-model-instance-1-worker-sgd6r 1/1 Running 0 12h 10.244.17.0 fake-node-017 +multinode-aggregated-0-model-instance-1-worker-vpkwb 1/1 Running 0 12h 10.244.18.0 fake-node-018 +multinode-aggregated-0-model-instance-2-leader-w5wfm 1/1 Running 0 11h 10.244.19.0 fake-node-019 +multinode-aggregated-0-model-instance-2-worker-59qm9 1/1 Running 0 11h 10.244.14.0 fake-node-014 +multinode-aggregated-0-model-instance-2-worker-9qqnx 1/1 Running 0 11h 10.244.20.0 fake-node-020 +multinode-aggregated-0-model-instance-2-worker-qqnl8 1/1 Running 0 11h 10.244.5.0 fake-node-005 +``` +Note how there are now four pods belonging to `multinode-aggregated-0-model-instance-0-worker` + +**When to scale what:** +- **Scale the PodCliqueScalingGroup** (`kubectl scale pcsg ...`) when you want to add more complete model instances (e.g., adding a second leader+workers unit for more capacity) +- **Scale individual PodCliques** (`kubectl scale pclq ...`) when you want to adjust the number of pods in a specific role within one instance (e.g., adding more workers to an existing leader-worker group as frameworks support elastic world sizes) + +### Cleanup +To teardown the example delete the `multinode-aggregated` PodCliqueSet, the operator will tear down all the constituent pieces + +```bash +kubectl delete pcs multinode-aggregated +``` + +--- + +## Example 4: Multi-Node Disaggregated Inference + +You can put together all the things we've covered to represent the most complex scenario: multi-node disaggregated serving where both the prefill and decode components are multi-node. We represent this in Grove by creating PodCliqueScalingGroups for both prefill and decode. Additionally each PodCliqueScalingGroup consists of two PodCliques, one for the leader and one for the worker. + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: multinode-disaggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: pleader + spec: + roleName: pleader + replicas: 1 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: pworker + spec: + roleName: pworker + replicas: 4 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + - name: dleader + spec: + roleName: dleader + replicas: 1 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" + - name: dworker + spec: + roleName: dworker + replicas: 2 + podSpec: # This is a standard Kubernetes PodSpec + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + podCliqueScalingGroups: + - name: prefill + cliqueNames: [pleader, pworker] + replicas: 2 + - name: decode + cliqueNames: [dleader, dworker] + replicas: 1 +``` + +### **Key Points:** +- Two independent **PodCliqueScalingGroups**: `prefill` and `decode` +- Each PodCliqueScalingGroup (PCSG) has PodCliques for leader and worker `pleader`,`pworker`,`dleader`,`dworker`. PodClique names need to be unique within a PodCliqueSet which is why we do not name the PodCliques `leader` and `worker` unlike the previous example +- Prefill PCSG consists of : 2 replicas × (1 leader + 4 workers) = 10 pods +- Decode PCSG: 1 replica × (1 leader + 2 workers) = 3 pods +- Each PCSG can scale independently based on workload demands +- Each PCSG can have different resource allocations + +### **Deploy** +```bash +kubectl apply -f samples/user-guide/concept-overview/multi-node-disaggregated.yaml +kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide +After running you will observe +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +multinode-disaggregated-0-decode-0-dleader-khqxf 1/1 Running 0 35s 10.244.19.0 fake-node-019 +multinode-disaggregated-0-decode-0-dworker-6d7cq 1/1 Running 0 35s 10.244.18.0 fake-node-018 +multinode-disaggregated-0-decode-0-dworker-g6ksp 1/1 Running 0 35s 10.244.20.0 fake-node-020 +multinode-disaggregated-0-prefill-0-pleader-f5w5j 1/1 Running 0 35s 10.244.6.0 fake-node-006 +multinode-disaggregated-0-prefill-0-pworker-7spmm 1/1 Running 0 35s 10.244.9.0 fake-node-009 +multinode-disaggregated-0-prefill-0-pworker-jgnkq 1/1 Running 0 35s 10.244.10.0 fake-node-010 +multinode-disaggregated-0-prefill-0-pworker-v49gf 1/1 Running 0 35s 10.244.11.0 fake-node-011 +multinode-disaggregated-0-prefill-0-pworker-xst4z 1/1 Running 0 35s 10.244.2.0 fake-node-002 +multinode-disaggregated-0-prefill-1-pleader-xwf45 1/1 Running 0 35s 10.244.16.0 fake-node-016 +multinode-disaggregated-0-prefill-1-pworker-6jrpz 1/1 Running 0 35s 10.244.15.0 fake-node-015 +multinode-disaggregated-0-prefill-1-pworker-bd5ct 1/1 Running 0 35s 10.244.14.0 fake-node-014 +multinode-disaggregated-0-prefill-1-pworker-fdl7s 1/1 Running 0 35s 10.244.7.0 fake-node-007 +multinode-disaggregated-0-prefill-1-pworker-kpplp 1/1 Running 0 35s 10.244.4.0 fake-node-004 +``` +Note how we have one replica of the decode PodCliqueScalingGroup and two replicas of the prefill PodCliqueScalingGroup. Also note how each prefill replica consists of 4 pods whereas each decode replica consists of 3 pods. This independence is critical to disaggregated serving as you can independently specify and scale prefill and decode components. + +### **Scaling** +Each of the PodCliqueScalingGroups and PodCliques can be scaled similar to the [previous example](#scaling-3). If you scale a PodCliqueScalingGroup it will replicate all its PodCliques while maintaining the replica ratio between them. If you scale a PodClique it will horizontally scale like a deployment. + +### Cleanup +To teardown the example delete the `multinode-disaggregated` PodCliqueSet, the operator will tear down all the constituent pieces + +```bash +kubectl delete pcs multinode-disaggregated +``` +In the [next guide](./takeaways.md) we showcase how Grove can represent an arbitrary number of components and summarize the key takeaways. \ No newline at end of file diff --git a/docs/user-guide/core-concepts/takeaways.md b/docs/user-guide/core-concepts/takeaways.md new file mode 100644 index 00000000..168d1588 --- /dev/null +++ b/docs/user-guide/core-concepts/takeaways.md @@ -0,0 +1,201 @@ +# Takeaways + +Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide. + +## Example 5: Complete Inference Pipeline + +The [previous examples](./pcsg_intro.md) have focused on mapping various inference workloads into Grove primitives, focusing on the model instances. However, the primitives are generic and the point of Grove is to allow the user to represent as many components as they'd like. To illustrate this point we now provide an example where we represent additional components such as a frontend and vision encoder. To add additional components you simply add additional PodCliques and PodCliqueScalingGroups into the PodCliqueSet + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: comp-inf-ppln + namespace: default +spec: + replicas: 1 + template: + cliques: + #single node components + - name: frontend + spec: + roleName: frontend + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: frontend + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Frontend Service on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "0.5" + memory: "1Gi" + - name: vision-encoder + spec: + roleName: vision-encoder + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: vision-encoder + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Vision Encoder on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "3" + memory: "6Gi" + # Multi-node components + - name: pleader + spec: + roleName: pleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: pworker + spec: + roleName: pworker + replicas: 4 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + - name: dleader + spec: + roleName: dleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" + - name: dworker + spec: + roleName: dworker + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + podCliqueScalingGroups: + - name: prefill + cliqueNames: [pleader, pworker] + replicas: 1 + - name: decode-cluster + cliqueNames: [dleader, dworker] + replicas: 1 +``` + +**Architecture Summary:** +- **Single-node components**: Frontend (2 replicas), Vision Encoder (1 replica) +- **Multi-node prefill**: 1 replica × (1 leader + 4 workers) = 5 pods +- **Multi-node decode**: 1 replica × (1 leader + 2 workers) = 3 pods +- **Total**: 11 pods providing a complete inference pipeline + +**Deploy and explore:** +```bash +kubectl apply -f samples/user-guide/concept-overview/complete-inference-pipeline.yaml +kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide +After running you will observe +``` +rohanv@rohanv-mlt operator % kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +comp-inf-ppln-0-decode-cluster-0-dleader-wr7r2 1/1 Running 0 51s 10.244.8.0 fake-node-008 +comp-inf-ppln-0-decode-cluster-0-dworker-4nm98 1/1 Running 0 51s 10.244.5.0 fake-node-005 +comp-inf-ppln-0-decode-cluster-0-dworker-wqzb9 1/1 Running 0 51s 10.244.2.0 fake-node-002 +comp-inf-ppln-0-frontend-fxxsg 1/1 Running 0 51s 10.244.1.0 fake-node-001 +comp-inf-ppln-0-frontend-shp8h 1/1 Running 0 51s 10.244.20.0 fake-node-020 +comp-inf-ppln-0-prefill-0-pleader-vgz8n 1/1 Running 0 51s 10.244.17.0 fake-node-017 +comp-inf-ppln-0-prefill-0-pworker-95jls 1/1 Running 0 51s 10.244.9.0 fake-node-009 +comp-inf-ppln-0-prefill-0-pworker-k8bck 1/1 Running 0 51s 10.244.4.0 fake-node-004 +comp-inf-ppln-0-prefill-0-pworker-qlsb9 1/1 Running 0 51s 10.244.14.0 fake-node-014 +comp-inf-ppln-0-prefill-0-pworker-wfxdg 1/1 Running 0 51s 10.244.15.0 fake-node-015 +comp-inf-ppln-0-vision-encoder-rwvz5 1/1 Running 0 51s 10.244.7.0 fake-node-007 +``` + +### Cleanup +To teardown the example delete the `comp-inf-ppln` PodCliqueSet, the operator will tear down all the constituent pieces + +```bash +kubectl delete pcs comp-inf-ppln +``` +--- + +## Key Takeaways + +Overall Grove primitives aim to provide a declarative way to express all the components of your system, allowing you to stitch together an arbitrary amount of single-node and multi-node components. + +### When to Use Each Component + +1. **PodClique**: + -Standalone Manner (not part of PodCliqueScalingGroup) + - Single-node components that can scale independently + - Examples: Frontend, API gateway, single-node model instances + -Within a PodCliqueScalingGroup + - Specific roles within a multi-node component, e.g. leader, worker + +2. **PodCliqueScalingGroup**: + - Multi-node components where one instance spans multiple pods and there are potentially different roles each pod takes (e.g. leader worker) + - When scaled creates new copy of constituent PodCliques while maintaining the ratio between them + +3. **PodCliqueSet**: + - Top level Custom Resource for representing the entire system + - Allows for replicating the entire system for blue-green deployments and/or availability across zones + - Contains user specified number of PodCliques and PodCliqueScalingGroups + + diff --git a/operator/Makefile b/operator/Makefile index 9762d18d..5f01e137 100644 --- a/operator/Makefile +++ b/operator/Makefile @@ -85,9 +85,11 @@ cover-html: test-cover # Make targets for local development and testing # ------------------------------------------------------------- # Starts a local k8s cluster using kind. +# Usage: make kind-up [FAKE_NODES=] +# Example: make kind-up FAKE_NODES=20 .PHONY: kind-up kind-up: $(KIND) $(YQ) - @$(MODULE_HACK_DIR)/kind-up.sh + @$(MODULE_HACK_DIR)/kind-up.sh $(if $(FAKE_NODES),--fake-nodes $(FAKE_NODES)) # Stops the local k8s cluster. .PHONY: kind-down diff --git a/operator/hack/kind-up.sh b/operator/hack/kind-up.sh index 5c13be68..d02c0171 100755 --- a/operator/hack/kind-up.sh +++ b/operator/hack/kind-up.sh @@ -28,6 +28,7 @@ CLUSTER_NAME="grove-test-cluster" DEPLOY_REGISTRY=true RECREATE_CLUSTER=false FEATURE_GATES=() +FAKE_NODES=0 USAGE="" function kind::create_usage() { @@ -38,6 +39,7 @@ function kind::create_usage() { -s | --skip-registry Skip creating a local docker registry. Default value is false. -r | --recreate If this flag is specified then it will recreate the cluster if it already exists. -g | --feature-gates Comma separated list of feature gates to enable on the cluster. + -f | --fake-nodes Number of fake nodes to create using KWOK. Default value is 0. ") echo "${usage}" } @@ -77,6 +79,10 @@ function kind::parse_flags() { IFS=',' read -r -a FEATURE_GATES <<< "$1" unset IFS ;; + --fake-nodes | -f) + shift + FAKE_NODES=$1 + ;; -h | --help) shift echo "${USAGE}" @@ -135,7 +141,16 @@ function kind::create_cluster() { mkdir -p "${KIND_CONFIG_DIR}" echo "Creating kind cluster ${CLUSTER_NAME}..." kind::generate_config + + # If KUBECONFIG is not already set (e.g., by the Makefile), set it to our default location + # This ensures kubectl commands target the correct cluster + if [ -z "${KUBECONFIG:-}" ]; then + export KUBECONFIG="${KIND_CONFIG_DIR}/kubeconfig" + echo "Setting KUBECONFIG to ${KUBECONFIG}" + fi + kind create cluster --name "${CLUSTER_NAME}" --config "${KIND_CONFIG_DIR}/cluster-config.yaml" + if [ "${DEPLOY_REGISTRY}" = true ]; then kind::initialize_registry kind::create_local_container_reg_configmap @@ -227,11 +242,158 @@ function kind::delete_container_registry() { fi } +function kind::check_fake_nodes_prerequisites() { + if ! command -v kubectl &> /dev/null; then + echo "kubectl is not installed. Please install kubectl from https://kubernetes.io/docs/tasks/tools/install-kubectl/" + exit 1 + fi + if ! command -v jq &> /dev/null; then + echo "jq is not installed. Please install jq from https://jqlang.org/download" + exit 1 + fi + if ! command -v curl &> /dev/null; then + echo "curl is not installed. Please install curl." + exit 1 + fi +} + +# deploy_kwok deploys KWOK using the instructions at https://kwok.sigs.k8s.io/docs/user/kwok-in-cluster/ +function kind::deploy_kwok() { + local kwok_repo="kubernetes-sigs/kwok" + local kwok_latest_release=$(curl -s "https://api.github.com/repos/${kwok_repo}/releases/latest" | jq -r '.tag_name') + echo "Deploying KWOK ${kwok_latest_release}..." + + # deploy KWOK CRDs and controller + echo " Installing KWOK CRDs and controller..." + kubectl apply -f "https://github.com/${kwok_repo}/releases/download/${kwok_latest_release}/kwok.yaml" > /dev/null + + # setup default custom resources of stages + echo " Setting up KWOK stage configurations..." + kubectl apply -f "https://github.com/${kwok_repo}/releases/download/${kwok_latest_release}/stage-fast.yaml" > /dev/null + + # Wait for KWOK controller to be ready + echo " Waiting for KWOK controller to be ready..." + kubectl wait --for=condition=available --timeout=60s deployment/kwok-controller -n kube-system > /dev/null + + echo "KWOK deployed successfully!" +} + +function kind::create_fake_nodes() { + local node_count=$1 + echo "Creating ${node_count} fake nodes..." + + for ((i=1; i<=node_count; i++)); do + local node_name="fake-node-$(printf "%03d" $i)" + cat < /dev/null +apiVersion: v1 +kind: Node +metadata: + annotations: + node.alpha.kubernetes.io/ttl: "0" + kwok.x-k8s.io/node: fake + labels: + beta.kubernetes.io/arch: amd64 + beta.kubernetes.io/os: linux + kubernetes.io/arch: amd64 + kubernetes.io/hostname: ${node_name} + kubernetes.io/os: linux + kubernetes.io/role: agent + node-role.kubernetes.io/agent: "" + type: kwok + name: ${node_name} +spec: + taints: + - effect: NoSchedule + key: fake-node + value: "true" +status: + allocatable: + cpu: "64" + ephemeral-storage: 1Ti + hugepages-1Gi: "0" + hugepages-2Mi: "0" + memory: 512Gi + pods: "110" + capacity: + cpu: "64" + ephemeral-storage: 1Ti + hugepages-1Gi: "0" + hugepages-2Mi: "0" + memory: 512Gi + pods: "110" + conditions: + - lastHeartbeatTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + lastTransitionTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + message: kubelet is posting ready status + reason: KubeletReady + status: "True" + type: Ready + - lastHeartbeatTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + lastTransitionTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + message: kubelet has sufficient memory available + reason: KubeletHasSufficientMemory + status: "False" + type: MemoryPressure + - lastHeartbeatTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + lastTransitionTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + message: kubelet has no disk pressure + reason: KubeletHasNoDiskPressure + status: "False" + type: DiskPressure + - lastHeartbeatTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + lastTransitionTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + message: kubelet has sufficient PID available + reason: KubeletHasSufficientPID + status: "False" + type: PIDPressure + - lastHeartbeatTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + lastTransitionTime: "$(date -u +%Y-%m-%dT%H:%M:%SZ)" + message: network is available + reason: RouteCreated + status: "False" + type: NetworkUnavailable + nodeInfo: + architecture: amd64 + bootID: "" + containerRuntimeVersion: "" + kernelVersion: "" + kubeProxyVersion: fake + kubeletVersion: fake + machineID: "" + operatingSystem: linux + osImage: "" + systemUUID: "" + phase: Running +EOF + echo " Created fake node: ${node_name}" + done + + echo "Successfully created ${node_count} fake nodes!" + echo "" + echo "Fake nodes are tainted with 'fake-node=true:NoSchedule'" + echo "To schedule pods on these nodes, add this toleration to your pod specs:" + echo " tolerations:" + echo " - key: fake-node" + echo " operator: Exists" + echo " effect: NoSchedule" +} + +function kind::setup_fake_nodes() { + if [ "${FAKE_NODES}" -gt 0 ]; then + echo "" + echo "Setting up ${FAKE_NODES} fake nodes..." + kind::check_fake_nodes_prerequisites + kind::deploy_kwok + kind::create_fake_nodes "${FAKE_NODES}" + fi +} + function main() { kind::check_prerequisites kind::parse_flags "$@" kind::clamp_mss_to_pmtu kind::create_cluster + kind::setup_fake_nodes printf "\n\033[0;33m📌 NOTE: To target the newly created kind cluster, please run the following command:\n\n export KUBECONFIG=${KUBECONFIG}\n\033[0m\n" } diff --git a/operator/samples/user-guide/concept-overview/complete-inference-pipeline.yaml b/operator/samples/user-guide/concept-overview/complete-inference-pipeline.yaml new file mode 100644 index 00000000..4471d14a --- /dev/null +++ b/operator/samples/user-guide/concept-overview/complete-inference-pipeline.yaml @@ -0,0 +1,132 @@ +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: comp-inf-ppln + namespace: default +spec: + replicas: 1 + template: + cliques: + #single node components + - name: frontend + spec: + roleName: frontend + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: frontend + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Frontend Service on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "0.5" + memory: "1Gi" + - name: vision-encoder + spec: + roleName: vision-encoder + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: vision-encoder + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Vision Encoder on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "3" + memory: "6Gi" + # Multi-node components + - name: pleader + spec: + roleName: pleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: pworker + spec: + roleName: pworker + replicas: 4 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + - name: dleader + spec: + roleName: dleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" + - name: dworker + spec: + roleName: dworker + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + podCliqueScalingGroups: + - name: prefill + cliqueNames: [pleader, pworker] + replicas: 1 + - name: decode-cluster + cliqueNames: [dleader, dworker] + replicas: 1 \ No newline at end of file diff --git a/operator/samples/user-guide/concept-overview/multi-node-aggregated.yaml b/operator/samples/user-guide/concept-overview/multi-node-aggregated.yaml new file mode 100644 index 00000000..7e5854f1 --- /dev/null +++ b/operator/samples/user-guide/concept-overview/multi-node-aggregated.yaml @@ -0,0 +1,51 @@ +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: multinode-aggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: leader + spec: + roleName: leader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Leader (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: worker + spec: + roleName: worker + replicas: 3 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + podCliqueScalingGroups: + - name: model-instance + cliqueNames: [leader, worker] + replicas: 2 \ No newline at end of file diff --git a/operator/samples/user-guide/concept-overview/multi-node-disaggregated.yaml b/operator/samples/user-guide/concept-overview/multi-node-disaggregated.yaml new file mode 100644 index 00000000..fb8c60d1 --- /dev/null +++ b/operator/samples/user-guide/concept-overview/multi-node-disaggregated.yaml @@ -0,0 +1,92 @@ +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: multinode-disaggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: pleader + spec: + roleName: pleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: pworker + spec: + roleName: pworker + replicas: 4 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "4" + memory: "8Gi" + - name: dleader + spec: + roleName: dleader + replicas: 1 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-leader + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Leader on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" + - name: dworker + spec: + roleName: dworker + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + podCliqueScalingGroups: + - name: prefill + cliqueNames: [pleader, pworker] + replicas: 2 + - name: decode + cliqueNames: [dleader, dworker] + replicas: 1 \ No newline at end of file diff --git a/operator/samples/user-guide/concept-overview/single-node-aggregated.yaml b/operator/samples/user-guide/concept-overview/single-node-aggregated.yaml new file mode 100644 index 00000000..291b0795 --- /dev/null +++ b/operator/samples/user-guide/concept-overview/single-node-aggregated.yaml @@ -0,0 +1,28 @@ +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: single-node-aggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: model-worker + spec: + roleName: model-worker + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: model-worker + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Model Worker (Aggregated) on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" \ No newline at end of file diff --git a/operator/samples/user-guide/concept-overview/single-node-disaggregated.yaml b/operator/samples/user-guide/concept-overview/single-node-disaggregated.yaml new file mode 100644 index 00000000..69aae30a --- /dev/null +++ b/operator/samples/user-guide/concept-overview/single-node-disaggregated.yaml @@ -0,0 +1,47 @@ +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: single-node-disaggregated + namespace: default +spec: + replicas: 1 + template: + cliques: + - name: prefill + spec: + roleName: prefill + replicas: 3 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: prefill + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Prefill Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "2" + memory: "4Gi" + - name: decode + spec: + roleName: decode + replicas: 2 + podSpec: + tolerations: + - key: fake-node + operator: Equal + value: "true" + effect: NoSchedule + containers: + - name: decode + image: nginx:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Decode Worker on node:' && hostname && sleep 3600"] + resources: + requests: + cpu: "1" + memory: "2Gi" \ No newline at end of file