-
Couldn't load subscription status.
- Fork 17
Add Core Concepts Tutorial #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
nvrohanv
commented
Oct 15, 2025
- Adding tutorial for introducing core Grove Primitives. Examples can be run on local kind cluster
- Allowing make kind-up to create arbitrary number of fake nodes
Signed-off-by: Rohan Varma <[email protected]>
Signed-off-by: Rohan Varma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick on smth that you didn't necessarily add but
"Let's try scaling the PodCliqueScalingGroup from 1 to 2 replicas:
kubectl scale pcsg simple1-0-pcsg --replicas=2"
didn't work for me. I had to run kubectl scale pcsg simple1-0-sga --replicas=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had to cd into /operator before the make targets worked. probably worth adding that step to make it "just work"
- Add "Navigate to the operator directory:
cd operator" before this step - Or change the command to:
cd operator && make kind-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the kind-up script currently has a bug and doesn't create the kubeconfig file properly. I had to manually
create it.
would love a gutcheck on this. i had to run:
# Create the kubeconfig file in the expected location
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig
# Set the KUBECONFIG environment variable (from the operator/ directory)
export KUBECONFIG=$(pwd)/hack/kind/kubeconfig
Also add a note that users need to keep the terminal open or re-export KUBECONFIG in new sessions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athreesh was the issue that you were in the root directory? Also ya it creates the kubeconfig in grove directory so you have to re-export because you have to use that kubeconfig instead of default (I think), I wasnt sure if we wanted to mess with the user's default one so our options are either
- make it be in default and just tell user to select the context (@gflarity below was mentioning something about default kube_config if its true that its added there anyways then this might be a good option)
- just explicitly call out that you have to make sure to set kubeconfig from the operator directory and re-export
which do you prefer?
Regarding the first two items i'll add that in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.
Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.
It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.
I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few suggestions around organization mostly. Please take a look and let me know if you have any questions.
| @@ -0,0 +1,24 @@ | |||
| # Grove Core Concepts Tutorial | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tgis is an overview, I'd recommend Core Concepts and Tutorial get moved into docs/user_guide/pcs_and_pclq_intro.md as we reference back anyways. I'd also rename that into tutorial.
| ## Prerequisites | ||
|
|
||
| Before starting this tutorial, ensure you have: | ||
| - [A Grove demo cluster running.](../installation.md#developing-grove) Make sure to run `make kind-up FAKE_NODES=40`, set `KUBECONFIG` env variable as directed in the instructions, and run `make deploy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd swap the ordering as unless we make a separate quick start guide, the tutorial is where folks will go to get this up and running in a real cluster for their POC. Might as well prioritize that. Just my 0.02.
| - name: model-worker | ||
| spec: | ||
| replicas: 2 | ||
| podSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| podSpec: | |
| podSpec: # This is a standard Kubernetes PodSpec |
| @@ -0,0 +1,319 @@ | |||
| # PodCliqueScalingGroup | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just put these all into a single tutorial file rather than split them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had it like that but my only worry was that it was too long so i decided to break it up into the concepts it actually exposes. What are your thoughts on that? I feel like pcs and pclq are one set of concepts and then pcsg is different so splitting them out makes the whole thing digestible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split across files, like it is being done right now. A single file will be too large to consume.
| requests: | ||
| cpu: "4" | ||
| memory: "8Gi" | ||
| podCliqueScalingGroups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the examples I think we should call out when you'd increase the PCS replicas vs when you would increase the PSG replicas, because this first example seems equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure can add a line about that
| @@ -0,0 +1,203 @@ | |||
| # Takeaways | |||
|
|
|||
| Refer to [Overview](./overview.md) for instructions on how to run the examples in this guide. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think this should go into one big file. I think you can add a TOC with markdown.
| echo "Creating kind cluster ${CLUSTER_NAME}..." | ||
| kind::generate_config | ||
|
|
||
| # If KUBECONFIG is not already set (e.g., by the Makefile), set it to our default location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fyi, ~/.kube/config is the defacto default without KUBECONFIG. New clusters get added there which can be good or bad. But you don't absolutely need to have KUBECONFIG set all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right the way it was set up the kind cluster relies on a kubeconfig different than the default, are you saying when i make a new cluster its auto added to the default config and we just need to instruct the user to set the context?
|
Oh, one more thing. I think a quickstart would also be useful (that doesn't involve the fakes). It's the first thing I look for a POC. |
…badge - Replace verbose technical description with problem-first approach - Add "One API. Any inference architecture." tagline for clarity - Include Quick Start section for immediate value demonstration - Add "What Grove Solves" table mapping use cases to capabilities - Simplify "How It Works" section with concise concept table - Add DeepWiki badge for community Q&A support - Update roadmap to use Q4 2025/Q1 2026 format Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/n as I've not gotten a chance to look through the entire PR yet.
|
|
||
| ## Core Concepts | ||
| # 2. Deploy Grove | ||
| kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed? The output of make kind-up is the following:
❯ make kind-up
...
Creating kind cluster grove-test-cluster...
Generating kind cluster config...
...
You can now use your cluster with:
kubectl cluster-info --context kind-grove-test-cluster
...
📌 NOTE: To target the newly created kind cluster, please run the following command:
export KUBECONFIG=/Users/renormalize/code/grove/operator/hack/kind/kubeconfigThe necessary KUBECONFIG that is to be exported, is printed out as the part of the output of the make target.
It will also always be written to $(pwd)/hack/kind/kubeconfig if they're creating a kind cluster (unless users intentionally set their KUBECONFIG path as something else).
We can therefore get rid of the kind get kube... command here, and only keep the export KUBECONFIG... below.
| kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you already have a KUBECONFIG env var exported in your session, the kind-up.sh script will use the path as specified in that env var. This path however is still printed out as a part of the script.
Shell sessions with an already set KUBECONFIG is not expected for most people getting started, since they would obviously not want other KUBECONFIG files overwritten. If they happen to, the script's output notfies where the kind cluster's KUBECONFIG is, which is the path they'd exported.
It is also a bad idea to overwrite the default KUBECONFIG at ~/.kube/config.
I'd like to know in what cases the script is going wrong so we can fix it, instead of making the quick start have one more step by including the kind get kubeconfig.... step.
| @@ -0,0 +1,24 @@ | |||
| # Grove Core Concepts Tutorial | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The convention the repository uses for directories, and other repositories in the Kubernetes ecosystem in general are hyphens as sepeartors, instead of underscores. Can this be changed to docs/user-guide/overview.md?
All directories/files introduced in this PR can be hyphenated instead of underscored.
| A **PodCliqueScalingGroup** coordinates multiple PodCliques that must scale together, preserving specified replica ratios across roles (e.g. leader/worker) in multi-node components. | ||
|
|
||
| ### PodCliqueSet: The Inference Service Container | ||
| A **PodCliqueSet** contains all the inference components for a complete service. It manages one or more PodCliques or PodCliqueScalingGroups that work together to provide inference capabilities. Can be replicated in order to provide blue-green deployment and spread across availability zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to talk about the "blue-green" deployment here?
Also, a PCS with multiple replicas can be spread across any toplogy, not just an availability zone.
Are we mentioning this here because grove does not support killing an entire PCS replica and recreating it in one go at the moment, as it is known that some frameworks have components that don't really play well with each other during upgrades.
I'm not really sure what all we want to mention here as this is an overview only.
|
|
||
| ## Example 1: Single-Node Aggregated Inference | ||
|
|
||
| In this simplest scenario, each pod is a complete model instance that can service requests. This is mapped to a single standalone PodClique within the PodCliqueSet. The PodClique provides horizontal scaling capabilities at the model replica level similar to a Deployment, and the PodCliqueSet provides horizontal scaling capabilities at the system level (useful for things such as blue-green deployments and spreading across availability zones). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be very comfortable with "essentially just a Deployment", since we have behavior like gang termination, which would never happen in a Deployment. "similar to a Deployment" is fine, in my opinion.
I understand that this is only an analogy, but I wouldn't go too far with it either. Also, PodCliques are closer to ReplicaSets, than Deployments.
| @@ -0,0 +1,319 @@ | |||
| # PodCliqueScalingGroup | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split across files, like it is being done right now. A single file will be too large to consume.
| ```bash | ||
| # actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly | ||
| kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml) | ||
| kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ```bash | |
| # actual multi-node-aggregated.yaml file is in samples/user_guide/concept_overview, change path accordingly | |
| kubectl apply -f [multi-node-aggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-aggregated.yaml) | |
| kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide | |
| ``` | |
| ```bash | |
| kubectl apply -f samples/user_guide/concept_overview/multi-node-aggregated.yaml | |
| kubectl get pods -l app.kubernetes.io/part-of=multinode-aggregated -o wide |
| ```bash | ||
| # actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly | ||
| kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml) | ||
| kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ```bash | |
| # actual multi-node-disaggregated.yaml is under /operator/samples/user_guide/concept_overview. Adjust paths accordingly | |
| kubectl apply -f [multi-node-disaggregated.yaml](../../operator/samples/user_guide/concept_overview/multi-node-disaggregated.yaml) | |
| kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide | |
| ``` | |
| ```bash | |
| kubectl apply -f samples/user_guide/concept_overview/multi-node-disaggregated.yaml | |
| kubectl get pods -l app.kubernetes.io/part-of=multinode-disaggregated -o wide |
| ```bash | ||
| # Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly | ||
| kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml) | ||
| kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ```bash | |
| # Actual complete-inference-pipeline.yaml is under /operator/samples/user_guide/concept_overview, adjust path accordingly | |
| kubectl apply -f [complete-inference-pipeline.yaml](../../operator/samples/user_guide/concept_overview/complete-inference-pipeline.yaml) | |
| kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide | |
| ``` | |
| ```bash | |
| kubectl apply -f samples/user_guide/concept_overview/complete-inference-pipeline.yaml | |
| kubectl get pods -l app.kubernetes.io/part-of=comp-inf-ppln -o wide |
| # This ensures kubectl commands target the correct cluster | ||
| if [ -z "${KUBECONFIG:-}" ]; then | ||
| export KUBECONFIG="${KIND_CONFIG_DIR}/kubeconfig" | ||
| echo "Setting KUBECONFIG to ${KUBECONFIG}" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. See
Line 30 in 7252d2c
| kind-up kind-down deploy deploy-dev deploy-debug undeploy deploy-addons: export KUBECONFIG = $(KUBECONFIG_PATH) |
Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>
Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>
Co-authored-by: Geoff Flarity <[email protected]> Signed-off-by: Anish <[email protected]>
Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Anish <[email protected]>
Co-authored-by: Saketh Kalaga <[email protected]> Signed-off-by: Anish <[email protected]>