Skip to content

Conversation

@nvrohanv
Copy link
Contributor

  • Adding tutorial for introducing core Grove Primitives. Examples can be run on local kind cluster
  • Allowing make kind-up to create arbitrary number of fake nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick on smth that you didn't necessarily add but

"Let's try scaling the PodCliqueScalingGroup from 1 to 2 replicas:
kubectl scale pcsg simple1-0-pcsg --replicas=2"

didn't work for me. I had to run kubectl scale pcsg simple1-0-sga --replicas=2

Copy link

@athreesh athreesh Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had to cd into /operator before the make targets worked. probably worth adding that step to make it "just work"

  • Add "Navigate to the operator directory: cd operator" before this step
  • Or change the command to: cd operator && make kind-up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the kind-up script currently has a bug and doesn't create the kubeconfig file properly. I had to manually
create it.

would love a gutcheck on this. i had to run:

# Create the kubeconfig file in the expected location
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig

# Set the KUBECONFIG environment variable (from the operator/ directory)
export KUBECONFIG=$(pwd)/hack/kind/kubeconfig

Also add a note that users need to keep the terminal open or re-export KUBECONFIG in new sessions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athreesh was the issue that you were in the root directory? Also ya it creates the kubeconfig in grove directory so you have to re-export because you have to use that kubeconfig instead of default (I think), I wasnt sure if we wanted to mess with the user's default one so our options are either

  1. make it be in default and just tell user to select the context (@gflarity below was mentioning something about default kube_config if its true that its added there anyways then this might be a good option)
  2. just explicitly call out that you have to make sure to set kubeconfig from the operator directory and re-export

which do you prefer?

Regarding the first two items i'll add that in

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.

Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.

It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.

I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.

Copy link
Contributor

@gflarity gflarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few suggestions around organization mostly. Please take a look and let me know if you have any questions.

@@ -0,0 +1,24 @@
# Grove Core Concepts Tutorial
Copy link
Contributor

@gflarity gflarity Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tgis is an overview, I'd recommend Core Concepts and Tutorial get moved into docs/user_guide/pcs_and_pclq_intro.md as we reference back anyways. I'd also rename that into tutorial.

## Prerequisites

Before starting this tutorial, ensure you have:
- [A Grove demo cluster running.](../installation.md#developing-grove) Make sure to run `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed in the instructions, and run `make deploy`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd swap the ordering as unless we make a separate quick start guide, the tutorial is where folks will go to get this up and running in a real cluster for their POC. Might as well prioritize that. Just my 0.02.

- name: model-worker
spec:
replicas: 2
podSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
podSpec:
podSpec: # This is a standard Kubernetes PodSpec

@@ -0,0 +1,319 @@
# PodCliqueScalingGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just put these all into a single tutorial file rather than split them up.

Copy link
Contributor Author

@nvrohanv nvrohanv Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it like that but my only worry was that it was too long so i decided to break it up into the concepts it actually exposes. What are your thoughts on that? I feel like pcs and pclq are one set of concepts and then pcsg is different so splitting them out makes the whole thing digestible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split across files, like it is being done right now. A single file will be too large to consume.

requests:
cpu: "4"
memory: "8Gi"
podCliqueScalingGroups:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the examples I think we should call out when you'd increase the PCS replicas vs when you would increase the PSG replicas, because this first example seems equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can add a line about that

@@ -0,0 +1,203 @@
# Takeaways

Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this should go into one big file. I think you can add a TOC with markdown.

echo "Creating kind cluster ${CLUSTER_NAME}..."
kind::generate_config

# If KUBECONFIG is not already set (e.g., by the Makefile), set it to our default location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fyi, ~/.kube/config is the defacto default without KUBECONFIG. New clusters get added there which can be good or bad. But you don't absolutely need to have KUBECONFIG set all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the way it was set up the kind cluster relies on a kubeconfig different than the default, are you saying when i make a new cluster its auto added to the default config and we just need to instruct the user to set the context?

@gflarity
Copy link
Contributor

Oh, one more thing. I think a quickstart would also be useful (that doesn't involve the fakes). It's the first thing I look for a POC.

…badge

- Replace verbose technical description with problem-first approach
- Add "One API. Any inference architecture." tagline for clarity
- Include Quick Start section for immediate value demonstration
- Add "What Grove Solves" table mapping use cases to capabilities
- Simplify "How It Works" section with concise concept table
- Add DeepWiki badge for community Q&A support
- Update roadmap to use Q4 2025/Q1 2026 format

Co-Authored-By: Claude <[email protected]>
Copy link
Collaborator

@renormalize renormalize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n as I've not gotten a chance to look through the entire PR yet.


## Core Concepts
# 2. Deploy Grove
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig
Copy link
Collaborator

@renormalize renormalize Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed? The output of make kind-up is the following:

❯ make kind-up
...
Creating kind cluster grove-test-cluster...
Generating kind cluster config...
...
You can now use your cluster with:

kubectl cluster-info --context kind-grove-test-cluster

...
📌 NOTE: To target the newly created kind cluster, please run the following command:

 export KUBECONFIG=/Users/renormalize/code/grove/operator/hack/kind/kubeconfig

The necessary KUBECONFIG that is to be exported, is printed out as the part of the output of the make target.

It will also always be written to $(pwd)/hack/kind/kubeconfig if they're creating a kind cluster (unless users intentionally set their KUBECONFIG path as something else).

We can therefore get rid of the kind get kube... command here, and only keep the export KUBECONFIG... below.

Suggested change
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.

Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.

It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.

I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.

@@ -0,0 +1,24 @@
# Grove Core Concepts Tutorial
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The convention the repository uses for directories, and other repositories in the Kubernetes ecosystem in general are hyphens as sepeartors, instead of underscores. Can this be changed to docs/user-guide/overview.md?

All directories/files introduced in this PR can be hyphenated instead of underscored.

A **PodCliqueScalingGroup** coordinates multiple PodCliques that must scale together, preserving specified replica ratios across roles (e.g. leader/worker) in multi-node components.

### PodCliqueSet: The Inference Service Container
A **PodCliqueSet** contains all the inference components for a complete service. It manages one or more PodCliques or PodCliqueScalingGroups that work together to provide inference capabilities. Can be replicated in order to provide blue-green deployment and spread across availability zones.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to talk about the "blue-green" deployment here?

Also, a PCS with multiple replicas can be spread across any toplogy, not just an availability zone.

Are we mentioning this here because grove does not support killing an entire PCS replica and recreating it in one go at the moment, as it is known that some frameworks have components that don't really play well with each other during upgrades.

I'm not really sure what all we want to mention here as this is an overview only.


## Example 1: Single-Node Aggregated Inference

In this simplest scenario, each pod is a complete model instance that can service requests. This is mapped to a single standalone PodClique within the PodCliqueSet. The PodClique provides horizontal scaling capabilities at the model replica level similar to a Deployment, and the PodCliqueSet provides horizontal scaling capabilities at the system level (useful for things such as blue-green deployments and spreading across availability zones).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be very comfortable with "essentially just a Deployment", since we have behavior like gang termination, which would never happen in a Deployment. "similar to a Deployment" is fine, in my opinion.

I understand that this is only an analogy, but I wouldn't go too far with it either. Also, PodCliques are closer to ReplicaSets, than Deployments.

@@ -0,0 +1,319 @@
# PodCliqueScalingGroup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split across files, like it is being done right now. A single file will be too large to consume.

Comment on lines +76 to +80
```bash
# actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly
kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml)
kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```bash
# actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly
kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml)
kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide
```
```bash
kubectl apply -f samples/user_guide/concept_overview/multi-node-aggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide

Comment on lines +285 to +289
```bash
# actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly
kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml)
kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```bash
# actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly
kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml)
kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide
```
```bash
kubectl apply -f samples/user_guide/concept_overview/multi-node-disaggregated.yaml
kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide

Comment on lines +151 to +155
```bash
# Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly
kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml)
kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```bash
# Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly
kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml)
kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide
```
```bash
kubectl apply -f samples/user_guide/concept_overview/complete-inference-pipeline.yaml
kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide

Comment on lines +146 to +150
# This ensures kubectl commands target the correct cluster
if [ -z "${KUBECONFIG:-}" ]; then
export KUBECONFIG="${KIND_CONFIG_DIR}/kubeconfig"
echo "Setting KUBECONFIG to ${KUBECONFIG}"
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. See

kind-up kind-down deploy deploy-dev deploy-debug undeploy deploy-addons: export KUBECONFIG = $(KUBECONFIG_PATH)

athreesh and others added 5 commits October 24, 2025 11:55
Co-authored-by: Geoff Flarity <[email protected]>
Signed-off-by: Anish <[email protected]>
Co-authored-by: Geoff Flarity <[email protected]>
Signed-off-by: Anish <[email protected]>
Co-authored-by: Geoff Flarity <[email protected]>
Signed-off-by: Anish <[email protected]>
Co-authored-by: Saketh Kalaga <[email protected]>
Signed-off-by: Anish <[email protected]>
Co-authored-by: Saketh Kalaga <[email protected]>
Signed-off-by: Anish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants