Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 49 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,81 @@
> [!NOTE]
>
> :construction_worker: `This project site is currently under active construction, keep watching for announcements!`

# Grove

Modern AI inference workloads need capabilities that Kubernetes doesn't provide out-of-the-box:

- **Gang scheduling** - Prefill and decode pods must start together or not at all
- **Grouped scaling** - Tightly-coupled components that need to scale as a unit
- **Startup ordering** - Different components in a workload which must start in an explicit ordering
- **Topology-aware placement** - NVLink-connected GPUs or workloads shouldn't be scattered across nodes

Grove is a Kubernetes API that provides a single declarative interface for orchestrating any AI inference workload — from simple, single-pod deployments to complex multi-node, disaggregated systems. Grove lets you scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. It allows you to describe your whole inference serving system in Kubernetes - e.g. prefill, decode, routing or any other component - as a single Custom Resource Definition (CRD). From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multi-level autoscaling and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.

**One API. Any inference architecture.**

## Quick Start

Get Grove running in 5 minutes:
[![Go Report Card](https://goreportcard.com/badge/github.com/ai-dynamo/grove/operator)](https://goreportcard.com/report/github.com/NVIDIA/grove/operator)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/grove)](https://github.com/ai-dynamo/grove/releases/latest)
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/GF45xZAX)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/grove)

```bash
# 1. Create a local kind cluster
cd operator && make kind-up

Grove is a Kubernetes API purpose-built for orchestrating AI workloads on GPU clusters. The modern inference landscape spans a wide range of workload types — from traditional single-node deployments where each model instance runs in a single pod, to large-scale disaggregated systems where one model instance may include multiple components such as prefill and decode, each distributed across many pods and nodes. Grove is designed to unify this entire spectrum under a single API, allowing developers to declaratively represent any inference workload by composing as many components as their system requires — whether single-node or multi-node — within one cohesive custom resource.
# 2. Deploy Grove
make deploy

Additionally, as workloads scale in size and complexity, achieving efficient resource utilization and optimal performance depends on capabilities such as all-or-nothing (“gang”) scheduling, topology-aware placement, prescriptive startup ordering, and independent scaling of components. Grove is designed with these needs as first-class citizens — providing native abstractions for expressing scheduling intent, topology constraints, startup dependencies, and per-component scaling behaviors that can be directly interpreted by underlying schedulers.
# 3. Deploy your first workload
kubectl apply -f samples/simple/simple1.yaml

## Core Concepts
# 4. Fetch the resources created by grove
kubectl get pcs,pclq,pcsg,pg,pod -owide
```

The Grove API consists of a user API and a scheduling API. While the user API (`PodCliqueSet`, `PodClique`, `PodCliqueScalingGroup`) allows users to represent their AI workloads, the scheduling API (`PodGang`) enables scheduler integration to support the network topology-optimized gang-scheduling and auto-scaling requirements of the workload.
**→ [Installation Docs](docs/installation.md)**

| Concept | Description |
|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [PodCliqueSet](operator/api/core/v1alpha1/podcliqueset.go) | The top-level Grove object that defines a group of components managed and colocated together. Also supports autoscaling with topology aware spread of PodCliqueSet replicas for availability. |
| [PodClique](operator/api/core/v1alpha1/podclique.go) | A group of pods representing a specific role (e.g., leader, worker, frontend). Each clique has an independent configuration and supports custom scaling logic. |
| [PodCliqueScalingGroup](operator/api/core/v1alpha1/scalinggroup.go) | A set of PodCliques that scale and are scheduled together as a gang. Ideal for tightly coupled roles like prefill leader and worker. |
| [PodGang](scheduler/api/core/v1alpha1/podgang.go) | The scheduler API that defines a unit of gang-scheduling. A PodGang is a collection of groups of similar pods, where each pod group defines a minimum number of replicas guaranteed for gang-scheduling. |
## What Grove Solves

Grove handles the complexities of modern AI inference deployments:

## Key Capabilities
| Your Setup | What Grove Does |
|------------|-----------------|
| **Disaggregated inference** (prefill + decode) | Gang schedules all components together, scales them independently and as a unit |
| **Multi-model pipelines** | Enforces startup order (router → workers), auto-scales each stage |
| **Multi-node inference** (DeepSeek-R1, Llama 405B) | Packs pods onto NVLink-connected GPUs for optimal network performance |
| **Simple single-pod serving** | Works for this too! One API for any architecture |

- **Declarative composition of Role-Based Pod Groups**
`PodCliqueSet` API provides users a capability to declaratively compose tightly coupled group of pods with explicit role based logic, e.g. disaggregated roles in a model serving stack such as `prefill`, `decode` and `routing`.
- **Flexible Gang Scheduling**
`PodClique`'s and `PodCliqueScalingGroup`s allow users to specify flexible gang-scheduling requirements at multiple levels within a `PodCliqueSet` to prevent resource deadlocks.
- **Multi-level Horizontal Auto-Scaling**
Supports pluggable horizontal auto-scaling solutions to scale `PodCliqueSet`, `PodClique` and `PodCliqueScalingGroup` custom resources.
- **Network Topology-Aware Scheduling**
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability.
- **Custom Startup Dependencies**
Prescribe the order in which the `PodClique`s must start in a declarative specification. Pod startup is decoupled from pod creation or scheduling.
- **Resource-Aware Rolling Updates**
Supports reuse of resource reservations of `Pod`s during updates in order to preserve topology-optimized placement.
**Use Cases:** [Multi-node disaggregated](docs/assets/multinode-disaggregated.excalidraw.png) · [Single-node disaggregated](docs/assets/singlenode-disaggregated.excalidraw.png) · [Agentic pipelines](docs/assets/agentic-pipeline.excalidraw.png) · [Standard serving](docs/assets/singlenode-aggregated.excalidraw.png)

## Example Use Cases
## How It Works

- **Multi-Node, Disaggregated Inference for large models** ***(DeepSeek-R1, Llama-4-Maverick)*** : [Visualization](docs/assets/multinode-disaggregated.excalidraw.png)
- **Single-Node, Disaggregated Inference** : [Visualization](docs/assets/singlenode-disaggregated.excalidraw.png)
- **Agentic Pipeline of Models** : [Visualization](docs/assets/agentic-pipeline.excalidraw.png)
- **Standard Aggregated Single Node or Single GPU Inference** : [Visualization](docs/assets/singlenode-aggregated.excalidraw.png)
Grove introduces four simple concepts:

## Getting Started
| Concept | What It Does |
|---------|--------------|
| **PodCliqueSet** | Your entire workload (e.g., "my-inference-stack") |
| **PodClique** | A component role (e.g., "prefill", "decode", "router") |
| **PodCliqueScalingGroup** | Components that must scale together (e.g., prefill + decode) |
| **PodGang** | Internal scheduler primitive for gang scheduling (you don't touch this) |

You can get started with the Grove operator by following our [installation guide](docs/installation.md).
**→ [API Reference](docs/api-reference/operator-api.md)**

## Roadmap

### 2025 Priorities

Update: We are aligning our release schedule with [Nvidia Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Once our release cadence (e.g., weekly, monthly) is finalized, it will be reflected here.
> **Note:** We are aligning our release schedule with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) to ensure seamless integration. Release dates will be updated once our cadence (e.g., weekly, monthly) is finalized.

**Release v0.1.0** *(ETA: Mid September 2025)*
- Grove v1alpha1 API
- Hierarchical Gang Scheduling and Gang Termination
**Q4 2025**
- Topology-Aware Scheduling
- Multi-Level Horizontal Auto-Scaling
- Startup Ordering
- Rolling Updates

**Release v0.2.0** *(ETA: October 2025)*
- Topology-Aware Scheduling
**Q1 2026**
- Resource-Optimized Rolling Updates

**Release v0.3.0** *(ETA: November 2025)*
- Multi-Node NVLink Auto-Scaling Support

## Contributions
Expand Down
130 changes: 124 additions & 6 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ You can use the published [Helm `grove-charts` package](https://github.com/ai-dy
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts:<tag>
```

You could also deploy Grove to your cluster through the provided make targets, by following [installation using make targets](#installation-using-make-targets).
You could also deploy Grove to your cluster through the provided make targets, by following [remote cluster setup](#remote-cluster-set-up) and [installation using make targets](#installation-using-make-targets).

## Developing Grove

Expand All @@ -23,12 +23,34 @@ All grove operator Make targets are located in [Operator Makefile](../operator/M

In case you wish to develop Grove using a local kind cluster, please do the following:

- To set up a KIND cluster with local docker registry run the following command:
- **Navigate to the operator directory:**

```bash
cd operator
```

- **Set up a KIND cluster with local docker registry:**

```bash
make kind-up
```

- **Optional**: To create a KIND cluster with fake nodes for testing at scale, specify the number of fake nodes:

```bash
# Create a cluster with 20 fake nodes
make kind-up FAKE_NODES=20
```

This will automatically install [KWOK](https://kwok.sigs.k8s.io/) (Kubernetes WithOut Kubelet) and create the specified number of fake nodes. These fake nodes are tainted with `fake-node=true:NoSchedule`, so you'll need to add the following toleration to your pod specs to schedule on them:

```yaml
tolerations:
- key: fake-node
operator: Exists
effect: NoSchedule
```

- Specify the `KUBECONFIG` environment variable in your shell session to the path printed out at the end of the previous step:

```bash
Expand Down Expand Up @@ -57,10 +79,16 @@ If you wish to use your own Kubernetes cluster instead of the KIND cluster, foll

### Installation using make targets

> **Important:** All commands in this section must be run from the `operator/` directory.

```bash
# If you wish to deploy all Grove Operator resources in a custom namespace then set the `NAMESPACE` environment variable
# Navigate to the operator directory (if not already there)
cd operator

# Optional: Deploy to a custom namespace
export NAMESPACE=custom-ns
# if `NAMESPACE` environment variable is set then `make deploy` target will use this namespace to deploy all Grove operator resources

# Deploy Grove operator and all resources
make deploy
```

Expand All @@ -76,6 +104,8 @@ This make target leverages Grove [Helm](https://helm.sh/) charts and [Skaffold](

## Deploy a `PodCliqueSet`

> **Important:** Ensure you're in the `operator/` directory for the relative path to work.

- Deploy one of the samples present in the [samples](../operator/samples/simple) directory.

```bash
Expand Down Expand Up @@ -127,7 +157,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
- Let's try scaling the `PodCliqueScalingGroup` from 1 to 2 replicas:

```bash
kubectl scale pcsg simple1-0-pcsg --replicas=2
kubectl scale pcsg simple1-0-sga --replicas=2
```

This will create new pods that associate with cliques that belong to this scaling group, and their associated `PodGang`s.
Expand Down Expand Up @@ -176,7 +206,7 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
Similarly, the `PodCliqueScalingGroup` can be scaled back in to 1 replicas like so:

```bash
kubectl scale pcsg simple1-0-pcsg --replicas=1
kubectl scale pcsg simple1-0-sga --replicas=1
```

- Scaling can also be triggered at the `PodCliqueSet` level, as can be seen here:
Expand Down Expand Up @@ -236,6 +266,94 @@ As specified in the [README.md](../README.md) and the [docs](../docs), there are
kubectl scale pcs simple1 --replicas=1
```

## Troubleshooting

### Deployment Issues

#### `make deploy` fails with "No rule to make target 'deploy'"

**Cause:** You're running the command from the wrong directory.

**Solution:** Ensure you're in the `operator/` directory:
```bash
cd operator
make deploy
```

#### `make deploy` fails with "unable to connect to Kubernetes"

**Cause:** The `KUBECONFIG` environment variable is not set correctly.

**Solution:** Export the kubeconfig for your kind cluster:
```bash
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig
export KUBECONFIG=$(pwd)/hack/kind/kubeconfig
make deploy
```

#### Grove operator pod is in `CrashLoopBackOff`

**Cause:** Check the operator logs for specific errors.

**Solution:**
```bash
kubectl logs -l app.kubernetes.io/name=grove-operator
```

### Runtime Issues

#### Pods stuck in `Pending` state

**Cause:** Gang scheduling requirements might not be met, or there aren't enough resources.

**Solution:**
1. Check PodGang status:
```bash
kubectl get pg -o yaml
```
2. Check if MinAvailable requirements can be satisfied by your cluster resources
3. Check node resources:
```bash
kubectl describe nodes
```

#### `kubectl scale` command fails with "not found"

**Cause:** The resource name might be incorrect.

**Solution:** List the actual resource names first:
```bash
# For PodCliqueScalingGroups
kubectl get pcsg

# For PodCliqueSets
kubectl get pcs
```

Then use the exact name from the output.

#### PodCliqueScalingGroup not auto-scaling

**Cause:** HPA might not be created or metrics-server might be missing.

**Solution:**
1. Verify HPA exists:
```bash
kubectl get hpa
```
2. Check if metrics-server is running (required for HPA):
```bash
kubectl get deployment metrics-server -n kube-system
```
3. For kind clusters, you may need to install metrics-server separately.

### Getting Help

If you encounter issues not covered here:
1. Check the [GitHub Issues](https://github.com/NVIDIA/grove/issues) for similar problems
2. Join the [Grove mailing list](https://groups.google.com/g/grove-k8s)
3. Start a [discussion thread](https://github.com/NVIDIA/grove/discussions)

## Supported Schedulers

Currently the following schedulers support gang scheduling of `PodGang`s created by the Grove operator:
Expand Down
Loading
Loading