Add Core Concepts Tutorial #217

nvrohanv · 2025-10-15T21:35:13Z

Adding tutorial for introducing core Grove Primitives. Examples can be run on local kind cluster
Allowing make kind-up to create arbitrary number of fake nodes

Signed-off-by: Rohan Varma <[email protected]>

athreesh · 2025-10-16T23:50:50Z

docs/installation.md

nitpick on smth that you didn't necessarily add but

"Let's try scaling the PodCliqueScalingGroup from 1 to 2 replicas:
kubectl scale pcsg simple1-0-pcsg --replicas=2"

didn't work for me. I had to run kubectl scale pcsg simple1-0-sga --replicas=2

I also had to cd into /operator before the make targets worked. probably worth adding that step to make it "just work"

Add "Navigate to the operator directory: cd operator" before this step

Or change the command to: cd operator && make kind-up

I think the kind-up script currently has a bug and doesn't create the kubeconfig file properly. I had to manually
create it.

would love a gutcheck on this. i had to run:

# Create the kubeconfig file in the expected location kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig # Set the KUBECONFIG environment variable (from the operator/ directory) export KUBECONFIG=$(pwd)/hack/kind/kubeconfig Also add a note that users need to keep the terminal open or re-export KUBECONFIG in new sessions.

@athreesh was the issue that you were in the root directory? Also ya it creates the kubeconfig in grove directory so you have to re-export because you have to use that kubeconfig instead of default (I think), I wasnt sure if we wanted to mess with the user's default one so our options are either

make it be in default and just tell user to select the context (@gflarity below was mentioning something about default kube_config if its true that its added there anyways then this might be a good option)

just explicitly call out that you have to make sure to set kubeconfig from the operator directory and re-export

which do you prefer?

Regarding the first two items i'll add that in

If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.

Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.

It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.

I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.

gflarity

Looks good overall, just a few suggestions around organization mostly. Please take a look and let me know if you have any questions.

docs/user_guide/overview.md

gflarity · 2025-10-17T13:17:50Z

docs/user_guide/overview.md

@@ -0,0 +1,24 @@
+# Grove Core Concepts Tutorial


Tgis is an overview, I'd recommend Core Concepts and Tutorial get moved into docs/user_guide/pcs_and_pclq_intro.md as we reference back anyways. I'd also rename that into tutorial.

gflarity · 2025-10-17T13:19:01Z

docs/user_guide/overview.md

+## Prerequisites
+
+Before starting this tutorial, ensure you have:
+- [A Grove demo cluster running.](../installation.md#developing-grove) Make sure to run `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed in the instructions, and run `make deploy`


I'd swap the ordering as unless we make a separate quick start guide, the tutorial is where folks will go to get this up and running in a real cluster for their POC. Might as well prioritize that. Just my 0.02.

gflarity · 2025-10-17T13:27:03Z

docs/user_guide/pcs_and_pclq_intro.md

+    - name: model-worker
+      spec:
+        replicas: 2
+        podSpec:


Suggested change

podSpec:

podSpec: # This is a standard Kubernetes PodSpec

gflarity · 2025-10-17T13:32:31Z

docs/user_guide/pcsg_intro.md

@@ -0,0 +1,319 @@
+# PodCliqueScalingGroup


I'd just put these all into a single tutorial file rather than split them up.

I initially had it like that but my only worry was that it was too long so i decided to break it up into the concepts it actually exposes. What are your thoughts on that? I feel like pcs and pclq are one set of concepts and then pcsg is different so splitting them out makes the whole thing digestible.

Let's split across files, like it is being done right now. A single file will be too large to consume.

gflarity · 2025-10-17T13:35:27Z

docs/user_guide/pcsg_intro.md

+              requests:
+                cpu: "4"
+                memory: "8Gi"
+    podCliqueScalingGroups:


After reading through the examples I think we should call out when you'd increase the PCS replicas vs when you would increase the PSG replicas, because this first example seems equivalent?

Sure can add a line about that

gflarity · 2025-10-17T13:47:29Z

docs/user_guide/takeaways.md

@@ -0,0 +1,203 @@
+# Takeaways
+
+Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.


Again, I think this should go into one big file. I think you can add a TOC with markdown.

gflarity · 2025-10-17T13:52:31Z

operator/hack/kind-up.sh

  echo "Creating kind cluster ${CLUSTER_NAME}..."
  kind::generate_config
+
+  # If KUBECONFIG is not already set (e.g., by the Makefile), set it to our default location


Just fyi, ~/.kube/config is the defacto default without KUBECONFIG. New clusters get added there which can be good or bad. But you don't absolutely need to have KUBECONFIG set all the time.

Right the way it was set up the kind cluster relies on a kubeconfig different than the default, are you saying when i make a new cluster its auto added to the default config and we just need to instruct the user to set the context?

gflarity · 2025-10-17T13:57:22Z

Oh, one more thing. I think a quickstart would also be useful (that doesn't involve the fakes). It's the first thing I look for a POC.

…badge - Replace verbose technical description with problem-first approach - Add "One API. Any inference architecture." tagline for clarity - Include Quick Start section for immediate value demonstration - Add "What Grove Solves" table mapping use cases to capabilities - Simplify "How It Works" section with concise concept table - Add DeepWiki badge for community Q&A support - Update roadmap to use Q4 2025/Q1 2026 format Co-Authored-By: Claude <[email protected]>

renormalize

1/n as I've not gotten a chance to look through the entire PR yet.

README.md

renormalize · 2025-10-23T10:51:59Z

README.md


-## Core Concepts
+# 2. Deploy Grove
+kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig


This is not needed? The output of make kind-up is the following:

❯ make kind-up ... Creating kind cluster grove-test-cluster... Generating kind cluster config... ... You can now use your cluster with: kubectl cluster-info --context kind-grove-test-cluster ... 📌 NOTE: To target the newly created kind cluster, please run the following command: export KUBECONFIG=/Users/renormalize/code/grove/operator/hack/kind/kubeconfig

The necessary KUBECONFIG that is to be exported, is printed out as the part of the output of the make target.

It will also always be written to $(pwd)/hack/kind/kubeconfig if they're creating a kind cluster (unless users intentionally set their KUBECONFIG path as something else).

We can therefore get rid of the kind get kube... command here, and only keep the export KUBECONFIG... below.

Suggested change

kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig

renormalize · 2025-10-23T11:00:37Z

docs/installation.md

If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.

Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.

It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.

I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.

README.md

renormalize · 2025-10-23T17:21:12Z

docs/user_guide/overview.md

@@ -0,0 +1,24 @@
+# Grove Core Concepts Tutorial


Nit: The convention the repository uses for directories, and other repositories in the Kubernetes ecosystem in general are hyphens as sepeartors, instead of underscores. Can this be changed to docs/user-guide/overview.md?

All directories/files introduced in this PR can be hyphenated instead of underscored.

docs/user_guide/overview.md

renormalize · 2025-10-23T17:35:30Z

docs/user_guide/overview.md

+A **PodCliqueScalingGroup** coordinates multiple PodCliques that must scale together, preserving specified replica ratios across roles (e.g. leader/worker) in multi-node components.
+
+### PodCliqueSet: The Inference Service Container
+A **PodCliqueSet** contains all the inference components for a complete service. It manages one or more PodCliques or PodCliqueScalingGroups that work together to provide inference capabilities. Can be replicated in order to provide blue-green deployment and spread across availability zones.


Are we sure we want to talk about the "blue-green" deployment here?

Also, a PCS with multiple replicas can be spread across any toplogy, not just an availability zone.

Are we mentioning this here because grove does not support killing an entire PCS replica and recreating it in one go at the moment, as it is known that some frameworks have components that don't really play well with each other during upgrades.

I'm not really sure what all we want to mention here as this is an overview only.

renormalize · 2025-10-23T18:04:23Z

docs/user_guide/pcs_and_pclq_intro.md

+
+## Example 1: Single-Node Aggregated Inference
+
+In this simplest scenario, each pod is a complete model instance that can service requests. This is mapped to a single standalone PodClique within the PodCliqueSet. The PodClique provides horizontal scaling capabilities at the model replica level similar to a Deployment, and the PodCliqueSet provides horizontal scaling capabilities at the system level (useful for things such as blue-green deployments and spreading across availability zones).


I wouldn't be very comfortable with "essentially just a Deployment", since we have behavior like gang termination, which would never happen in a Deployment. "similar to a Deployment" is fine, in my opinion.

I understand that this is only an analogy, but I wouldn't go too far with it either. Also, PodCliques are closer to ReplicaSets, than Deployments.

renormalize · 2025-10-23T18:31:23Z

docs/user_guide/pcsg_intro.md

@@ -0,0 +1,319 @@
+# PodCliqueScalingGroup


Let's split across files, like it is being done right now. A single file will be too large to consume.

renormalize · 2025-10-23T18:55:09Z

docs/user_guide/pcsg_intro.md

+```bash
+# actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly
+kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml)
+kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide
+```


Suggested change

```bash

# actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly

kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml)

kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide

```

```bash

kubectl apply -f samples/user_guide/concept_overview/multi-node-aggregated.yaml

kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide

renormalize · 2025-10-23T18:59:34Z

docs/user_guide/pcsg_intro.md

+```bash
+# actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly
+kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml)
+kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide
+```


Suggested change

```bash

# actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly

kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml)

kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide

```

```bash

kubectl apply -f samples/user_guide/concept_overview/multi-node-disaggregated.yaml

kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide

renormalize · 2025-10-23T19:02:07Z

docs/user_guide/takeaways.md

+```bash
+# Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly
+kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml)
+kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide
+```


Suggested change

```bash

# Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly

kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml)

kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide

```

```bash

kubectl apply -f samples/user_guide/concept_overview/complete-inference-pipeline.yaml

kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide

renormalize · 2025-10-23T19:07:19Z

operator/hack/kind-up.sh

+  # This ensures kubectl commands target the correct cluster
+  if [ -z "${KUBECONFIG:-}" ]; then
+    export KUBECONFIG="${KIND_CONFIG_DIR}/kubeconfig"
+    echo "Setting KUBECONFIG to ${KUBECONFIG}"
+  fi


Not needed. See

grove/operator/Makefile

Line 30 in 7252d2c

kind-up kind-down deploy deploy-dev deploy-debug undeploy deploy-addons: export KUBECONFIG = $(KUBECONFIG_PATH)

Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>

Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Anish <[email protected]>

nvrohanv added 2 commits October 15, 2025 14:20

add concept overview doc and demo

2c3bb4f

Signed-off-by: Rohan Varma <[email protected]>

split up core-concepts guide into more readable unit

0a7dd98

Signed-off-by: Rohan Varma <[email protected]>

nvrohanv requested review from sanjaychatterjee and unmarshall as code owners October 15, 2025 21:35

nvrohanv requested a review from athreesh October 15, 2025 21:35

athreesh reviewed Oct 16, 2025

View reviewed changes

gflarity requested changes Oct 17, 2025

View reviewed changes

renormalize reviewed Oct 23, 2025

View reviewed changes

renormalize requested changes Oct 23, 2025

View reviewed changes

athreesh and others added 5 commits October 24, 2025 11:55

Update docs/user_guide/overview.md

6f35e6f

Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>

Update docs/user_guide/overview.md

8e6297c

Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>

Update docs/user_guide/overview.md

f26e99b

Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>

Update README.md

e5ffd48

Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Anish <[email protected]>

Update README.md

f849cbe

Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Anish <[email protected]>

		@@ -0,0 +1,203 @@
		# Takeaways

		Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.


		## Example 1: Single-Node Aggregated Inference

		In this simplest scenario, each pod is a complete model instance that can service requests. This is mapped to a single standalone PodClique within the PodCliqueSet. The PodClique provides horizontal scaling capabilities at the model replica level similar to a Deployment, and the PodCliqueSet provides horizontal scaling capabilities at the system level (useful for things such as blue-green deployments and spreading across availability zones).

Uh oh!

Add Core Concepts Tutorial #217

Are you sure you want to change the base?

Add Core Concepts Tutorial #217

Uh oh!

Conversation

nvrohanv commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athreesh Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gflarity Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvrohanv Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gflarity commented Oct 17, 2025

Uh oh!

renormalize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

renormalize Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athreesh Oct 16, 2025 •

edited

Loading

gflarity Oct 17, 2025 •

edited

Loading

nvrohanv Oct 18, 2025 •

edited

Loading

renormalize Oct 23, 2025 •

edited

Loading