ai-dynamo · nvrohanv · Oct 15, 2025 · Oct 15, 2025 · Oct 19, 2025 · Oct 22, 2025
@@ -1,74 +1,81 @@
-> [!NOTE]
->
-> :construction_worker: `This project site is currently under active construction, keep watching for announcements!`
-
 # Grove
 
+Modern AI inference workloads need capabilities that Kubernetes doesn't provide out-of-the-box:
+
+- **Gang scheduling** - Prefill and decode pods must start together or not at all
+- **Grouped scaling** - Tightly-coupled components that need to scale as a unit
+- **Startup ordering** - Different components in a workload which must start in an explicit ordering
+- **Topology-aware placement** - NVLink-connected GPUs or workloads shouldn't be scattered across nodes
+
+Grove is a Kubernetes API that provides a single declarative interface for orchestrating any AI inference workload — from simple, single-pod deployments to complex multi-node, disaggregated systems. Grove lets you scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. It allows you to describe your whole inference serving system in Kubernetes - e.g. prefill, decode, routing or any other component - as a single Custom Resource Definition (CRD). From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multi-level autoscaling and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.
+
+**One API. Any inference architecture.**
+
+## Quick Start
+
+Get Grove running in 5 minutes:
 [![Go Report Card](https://goreportcard.com/badge/github.com/ai-dynamo/grove/operator)](https://goreportcard.com/report/github.com/NVIDIA/grove/operator)
 [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/grove)](https://github.com/ai-dynamo/grove/releases/latest)
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/GF45xZAX)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/grove)
+
+```bash
+# 1. Create a local kind cluster
+cd operator && make kind-up
 
-Grove is a Kubernetes API purpose-built for orchestrating AI workloads on GPU clusters. The modern inference landscape spans a wide range of workload types — from traditional single-node deployments where each model instance runs in a single pod, to large-scale disaggregated systems where one model instance may include multiple components such as prefill and decode, each distributed across many pods and nodes. Grove is designed to unify this entire spectrum under a single API, allowing developers to declaratively represent any inference workload by composing as many components as their system requires — whether single-node or multi-node — within one cohesive custom resource.
+# 2. Deploy Grove
+make deploy
 
-Additionally, as workloads scale in size and complexity, achieving efficient resource utilization and optimal performance depends on capabilities such as all-or-nothing (“gang”) scheduling, topology-aware placement, prescriptive startup ordering, and independent scaling of components. Grove is designed with these needs as first-class citizens — providing native abstractions for expressing scheduling intent, topology constraints, startup dependencies, and per-component scaling behaviors that can be directly interpreted by underlying schedulers.
+# 3. Deploy your first workload
+kubectl apply -f samples/simple/simple1.yaml
 
-## Core Concepts
+# 4. Fetch the resources created by grove
+kubectl get pcs,pclq,pcsg,pg,pod -owide
+```
 
-The Grove API consists of a user API and a scheduling API. While the user API (`PodCliqueSet`, `PodClique`, `PodCliqueScalingGroup`) allows users to represent their AI workloads, the scheduling API (`PodGang`) enables scheduler integration to support the network topology-optimized gang-scheduling and auto-scaling requirements of the workload.
+**→ [Installation Docs](docs/installation.md)**
 
-| Concept                                                             | Description                                                                                                                                                                                              |
-|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [PodCliqueSet](operator/api/core/v1alpha1/podcliqueset.go)          | The top-level Grove object that defines a group of components managed and colocated together. Also supports autoscaling with topology aware spread of PodCliqueSet replicas for availability.            |
-| [PodClique](operator/api/core/v1alpha1/podclique.go)                | A group of pods representing a specific role (e.g., leader, worker, frontend). Each clique has an independent configuration and supports custom scaling logic.                                           |
-| [PodCliqueScalingGroup](operator/api/core/v1alpha1/scalinggroup.go) | A set of PodCliques that scale and are scheduled together as a gang. Ideal for tightly coupled roles like prefill leader and worker.                                                                               |
-| [PodGang](scheduler/api/core/v1alpha1/podgang.go)                   | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. |
+## What Grove Solves
 
+Grove handles the complexities of modern AI inference deployments:
 
-## Key Capabilities
+| Your Setup | What Grove Does |
+|------------|-----------------|
+| **Disaggregated inference** (prefill + decode) | Gang schedules all components together, scales them independently and as a unit |
+| **Multi-model pipelines** | Enforces startup order (router → workers), auto-scales each stage |
+| **Multi-node inference** (DeepSeek-R1, Llama 405B) | Packs pods onto NVLink-connected GPUs for optimal network performance |
+| **Simple single-pod serving** | Works for this too! One API for any architecture |
 
-- **Declarative composition of Role-Based Pod Groups**
-  `PodCliqueSet` API provides users a capability to declaratively compose tightly coupled group of pods with explicit role based logic, e.g. disaggregated roles in a model serving stack such as `prefill`, `decode` and `routing`.
-- **Flexible Gang Scheduling**
-  `PodClique`'s and `PodCliqueScalingGroup`s allow users to specify flexible gang-scheduling requirements at multiple levels within a `PodCliqueSet` to prevent resource deadlocks.
-- **Multi-level Horizontal Auto-Scaling**
-  Supports pluggable horizontal auto-scaling solutions to scale `PodCliqueSet`, `PodClique` and `PodCliqueScalingGroup` custom resources.
-- **Network Topology-Aware Scheduling**
-  Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability.
-- **Custom Startup Dependencies**
-  Prescribe the order in which the `PodClique`s must start in a declarative specification. Pod startup is decoupled from pod creation or scheduling.
-- **Resource-Aware Rolling Updates**
-  Supports reuse of resource reservations of `Pod`s during updates in order to preserve topology-optimized placement.
+**Use Cases:** [Multi-node disaggregated](docs/assets/multinode-disaggregated.excalidraw.png) · [Single-node disaggregated](docs/assets/singlenode-disaggregated.excalidraw.png) · [Agentic pipelines](docs/assets/agentic-pipeline.excalidraw.png) · [Standard serving](docs/assets/singlenode-aggregated.excalidraw.png)
 
-## Example Use Cases
+## How It Works
 
-- **Multi-Node, Disaggregated Inference for large models** ***(DeepSeek-R1, Llama-4-Maverick)*** : [Visualization](docs/assets/multinode-disaggregated.excalidraw.png)
-- **Single-Node, Disaggregated Inference** : [Visualization](docs/assets/singlenode-disaggregated.excalidraw.png)
-- **Agentic Pipeline of Models** : [Visualization](docs/assets/agentic-pipeline.excalidraw.png)
-- **Standard Aggregated Single Node or Single GPU Inference** : [Visualization](docs/assets/singlenode-aggregated.excalidraw.png)
+Grove introduces four simple concepts:
 
-## Getting Started
+| Concept | What It Does |
+|---------|--------------|
+| **PodCliqueSet** | Your entire workload (e.g., "my-inference-stack") |
+| **PodClique** | A component role (e.g., "prefill", "decode", "router") |
+| **PodCliqueScalingGroup** | Components that must scale together (e.g., prefill + decode) |
+| **PodGang** | Internal scheduler primitive for gang scheduling (you don't touch this) |
 
-You can get started with the Grove operator by following our [installation guide](docs/installation.md).
+**→ [API Reference](docs/api-reference/operator-api.md)**
 
 ## Roadmap
 
 ### 2025 Priorities
 
-Update: We are aligning our release schedule with [Nvidia Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Once our release cadence (e.g., weekly, monthly) is finalized, it will be reflected here.
+> **Note:** We are aligning our release schedule with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Release dates will be updated once our cadence (e.g., weekly, monthly) is finalized.
 
-**Release v0.1.0** *(ETA: Mid September 2025)*
-- Grove v1alpha1 API
-- Hierarchical Gang Scheduling and Gang Termination
+**Q4 2025**
+- Topology-Aware Scheduling
 - Multi-Level Horizontal Auto-Scaling
 - Startup Ordering
 - Rolling Updates
 
-**Release v0.2.0** *(ETA: October 2025)*
-- Topology-Aware Scheduling
+**Q1 2026**
 - Resource-Optimized Rolling Updates
-
-**Release v0.3.0** *(ETA: November 2025)*
 - Multi-Node NVLink Auto-Scaling Support
 
 ## Contributions

@@ -11,7 +11,7 @@ You can use the published [Helm `grove-charts` package](https://github.com/ai-dy
 helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag>
 ```
 
-You could also deploy Grove to your cluster through the provided make targets, by following [installation using make targets](#installation-using-make-targets).
+You could also deploy Grove to your cluster through the provided make targets, by following [remote cluster setup](#remote-cluster-set-up) and [installation using make targets](#installation-using-make-targets).
 
 ## Developing Grove
 
@@ -23,12 +23,34 @@ All grove operator Make targets are located in [Operator Makefile](../operator/M
 
 In case you wish to develop Grove using a local kind cluster, please do the following:
 
-- To set up a KIND cluster with local docker registry run the following command:
+- **Navigate to the operator directory:**
+
+  ```bash
+  cd operator
+  ```
+
+- **Set up a KIND cluster with local docker registry:**
 
   ```bash
   make kind-up
   ```
 
+- **Optional**: To create a KIND cluster with fake nodes for testing at scale, specify the number of fake nodes:
+
+  ```bash
+  # Create a cluster with 20 fake nodes
+  make kind-up FAKE_NODES=20
+  ```
+
+  This will automatically install [KWOK](https://kwok.sigs.k8s.io/) (Kubernetes WithOut Kubelet) and create the specified number of fake nodes. These fake nodes are tainted with `fake-node=true:NoSchedule`, so you'll need to add the following toleration to your pod specs to schedule on them:
+
+  ```yaml
+  tolerations:
+  - key: fake-node
+    operator: Exists
+    effect: NoSchedule
+  ```
+
 - Specify the `KUBECONFIG` environment variable in your shell session to the path printed out at the end of the previous step:
 
   ```bash
@@ -57,10 +79,16 @@ If you wish to use your own Kubernetes cluster instead of the KIND cluster, foll
 
 ### Installation using make targets
 
+> **Important:** All commands in this section must be run from the `operator/` directory.
+
 ```bash
-# If you wish to deploy all Grove Operator resources in a custom namespace then set the `NAMESPACE` environment variable
+# Navigate to the operator directory (if not already there)
+cd operator
+
+# Optional: Deploy to a custom namespace
 export NAMESPACE=custom-ns
-# if `NAMESPACE` environment variable is set then `make deploy` target will use this namespace to deploy all Grove operator resources
+
+# Deploy Grove operator and all resources
 make deploy
 ```
 
@@ -76,6 +104,8 @@ This make target leverages Grove [Helm](https://helm.sh/) charts and [Skaffold](
 
 ## Deploy a `PodCliqueSet`
 
+> **Important:** Ensure you're in the `operator/` directory for the relative path to work.
+
 - Deploy one of the samples present in the [samples](../operator/samples/simple) directory.
 
   ```bash
@@ -127,7 +157,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
 - Let's try scaling the `PodCliqueScalingGroup` from 1 to 2 replicas:
 
   ```bash
-  kubectl scale pcsg simple1-0-pcsg --replicas=2
+  kubectl scale pcsg simple1-0-sga --replicas=2
   ```
 
   This will create new pods that associate with cliques that belong to this scaling group, and their associated `PodGang`s.
@@ -176,7 +206,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
   Similarly, the `PodCliqueScalingGroup` can be scaled back in to 1 replicas like so:
 
   ```bash
-  kubectl scale pcsg simple1-0-pcsg --replicas=1
+  kubectl scale pcsg simple1-0-sga --replicas=1
   ```
 
 - Scaling can also be triggered at the `PodCliqueSet` level, as can be seen here:
@@ -236,6 +266,94 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
   kubectl scale pcs simple1 --replicas=1
   ```
 
+## Troubleshooting
+
+### Deployment Issues
+
+#### `make deploy` fails with "No rule to make target 'deploy'"
+
+**Cause:** You're running the command from the wrong directory.
+
+**Solution:** Ensure you're in the `operator/` directory:
+```bash
+cd operator
+make deploy
+```
+
+#### `make deploy` fails with "unable to connect to Kubernetes"
+
+**Cause:** The `KUBECONFIG` environment variable is not set correctly.
+
+**Solution:** Export the kubeconfig for your kind cluster:
+```bash
+kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig
+export KUBECONFIG=$(pwd)/hack/kind/kubeconfig
+make deploy
+```
+
+#### Grove operator pod is in `CrashLoopBackOff`
+
+**Cause:** Check the operator logs for specific errors.
+
+**Solution:**
+```bash
+kubectl logs -l app.kubernetes.io/name=grove-operator
+```
+
+### Runtime Issues
+
+#### Pods stuck in `Pending` state
+
+**Cause:** Gang scheduling requirements might not be met, or there aren't enough resources.
+
+**Solution:**
+1. Check PodGang status:
+   ```bash
+   kubectl get pg -o yaml
+   ```
+2. Check if MinAvailable requirements can be satisfied by your cluster resources
+3. Check node resources:
+   ```bash
+   kubectl describe nodes
+   ```
+
+#### `kubectl scale` command fails with "not found"
+
+**Cause:** The resource name might be incorrect.
+
+**Solution:** List the actual resource names first:
+```bash
+# For PodCliqueScalingGroups
+kubectl get pcsg
+
+# For PodCliqueSets
+kubectl get pcs
+```
+
+Then use the exact name from the output.
+
+#### PodCliqueScalingGroup not auto-scaling
+
+**Cause:** HPA might not be created or metrics-server might be missing.
+
+**Solution:**
+1. Verify HPA exists:
+   ```bash
+   kubectl get hpa
+   ```
+2. Check if metrics-server is running (required for HPA):
+   ```bash
+   kubectl get deployment metrics-server -n kube-system
+   ```
+3. For kind clusters, you may need to install metrics-server separately.
+
+### Getting Help
+
+If you encounter issues not covered here:
+1. Check the [GitHub Issues](https://github.com/NVIDIA/grove/issues) for similar problems
+2. Join the [Grove mailing list](https://groups.google.com/g/grove-k8s)
+3. Start a [discussion thread](https://github.com/NVIDIA/grove/discussions)
+
 ## Supported Schedulers
 
 Currently the following schedulers support gang scheduling of `PodGang`s created by the Grove operator: