-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958
base: master
Are you sure you want to change the base?
[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: astefanutti, kannon92, shravan-achar, vsoch, kubeflow/wg-training-leads, kubeflow/release-team, akshaychitneni, seanlaii, varshaprasad96, saileshd1402. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
{{% alert title="Old Version" color="warning" %}} | ||
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information. | ||
|
||
Please follow [this guide for migrating to Kubeflow Training V2](/docs/components/training/admin-guides/migration) | ||
{{% /alert %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if that message looks good to you @kubeflow/wg-training-leads @rimolive @varodrig @hbelmiro @StefanoFioravanzo.
If yes, I will add it to all Kubeflow Training V1 docs, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/legacy-v1/overview/quickstart/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Great contributions! I left some initial comments for you.
- Kubernetes >= 1.27 | ||
- `kubectl` >= 1.27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we've removed our support for 1.27: https://github.com/kubeflow/training-operator/blob/1dfa40c12516fc9eb2ce12c5ef52da7d46670457/.github/workflows/unittests.yaml#L21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me update it.
|
||
``` | ||
|
||
## Installing the Kubeflow Training Runtimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it will be better if we could provide a standalone installation guide:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am planning to refactor our manifests given that the Cluster Training Runtime needs to be installed after the manager.
I will soon submit a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess installing the cluster training runtime requires the CRDs to be installed first, more than the manager to be deployed. Projects tend to separate the steps to install CRDs first, then the rest of the manifests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we also need to install manager before we deploy the CTR, since we perform validation and mutation via webhook.
- Kubernetes >= 1.27 | ||
- `kubectl` >= 1.27 | ||
- Python >= 3.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Fix links Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
content/en/docs/components/training/admin-guides/installation.md
Outdated
Show resolved
Hide resolved
TODO (andreyvelich): Change the link once V1 is removed. | ||
|
||
```bash | ||
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/v2/overlays/manager?ref=master" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't most copy paste a full URL that starts with https?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is how you can use remote URL with kubectl and kustomize (e.g. they accept the SSH url: github.com:kubeflow/training-operator.git
)
|
||
## Prerequisites | ||
|
||
These are the minimal requirements to install the Training Operator: | ||
Ensure that you have access to a Kubernetes clusters with the Kubeflow Training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that you have access to a Kubernetes clusters with the Kubeflow Training | |
Ensure that you have access to a Kubernetes cluster with the Kubeflow Training |
[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy | ||
Kubeflow Training on a local Kind cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy | |
Kubeflow Training on a local Kind cluster. | |
[the installation guide](/docs/components/training/admin-guides/installation) to deploy | |
Kubeflow Training. |
No reason it needs to be kind, and if there are webhooks it won't be that quick :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally add this message to tell that it is super easy to deploy Kubernetes locally, and quickly try Kubeflow Training. I don't want to scare our ML users with "Kubernetes" dependency.
@kubeflow/wg-training-leads @vsoch @astefanutti @franciscojavierarceo @StefanoFioravanzo Any thoughts on how we can phrase it better in docs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @vsoch suggestion, it doesn't seem there is a need to be too specific here, "quickly deploy Kubeflow Trainer" straight makes it even less scary here and do not exclude non-Kind options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think we can make this message better, especially for those users who don't know what is Kubernetes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
## Installing the Training Operator | ||
You can chose between installing the latest stable release of the development version from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can chose between installing the latest stable release of the development version from | |
You can chose between installing the latest stable release or the development version from |
PyTorch, TensorFlow, XGBoost, JAX, and others. | ||
## What is the Kubeflow Training | ||
|
||
The Kubeflow Training is a Kubernetes-native project for large language models (LLMs) fine-tuning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kubeflow Training is a Kubernetes-native project for large language models (LLMs) fine-tuning | |
The Kubeflow Training project is a Kubernetes-native project for large language models (LLMs) fine-tuning |
with the Training Operator to orchestrate their ML training on Kubernetes. | ||
with the Kubeflow Training to orchestrate their ML training on Kubernetes. | ||
|
||
The Kubeflow Training allows you effortlessly develop your LLMs with the Kubeflow Python SDK and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto (and above this)
supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC. | ||
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version, | ||
please follow [this guide](/docs/components/training/user-guides/mpi/) to install MPI Operator V2. | ||
The Kubeflow Training is designed for two primary user personas, each with specific resources and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
various distributed training strategies for different ML frameworks. | ||
### User Personas | ||
|
||
The Kubeflow Training documentation is separated between these user personas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
|
||
## Why use the Kubeflow Training | ||
|
||
The Kubeflow Training supports key phases on the AI/ML lifecycle, including model training and LLMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
Signed-off-by: Andrey Velichkevich <[email protected]>
434733e
to
c8d5eff
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
content/en/_redirects
Outdated
/docs/components/training/user-guides/prometheus /docs/components/training/legacy-v1/user-guides/prometheus | ||
/docs/components/training/user-guides/tensorflow /docs/components/training/legacy-v1/user-guides/tensorflow | ||
/docs/components/training/user-guides/xgboost /docs/components/training/legacy-v1/user-guides/xgboost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed with @thesuperzapper how to structure the new docs, and he suggested that we put new user guides under
/docs/components/training/user-guides-v2
For example: https://deploy-preview-3958--competent-brattain-de2d6d.netlify.app/docs/components/training/user-guides-v2/pytorch/
That will allow us to redirect the V1 docs (e.g. TFJob) to the correct location.
What do you think about it @kubeflow/wg-training-leads ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich make sure you also redirect the "index" pages e.g.
/docs/components/training/explanation/
->/docs/components/training/legacy-v1/explanation/
/docs/components/training/user-guides/
->/docs/components/training/legacy-v1/user-guides/
/docs/components/training/reference/
->/docs/components/training/legacy-v1/reference/
Also, I am not sure if it matters, but all other redirects use a trailing /
slash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, let me fix that.
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
title = "Training Operator" | ||
description = "Documentation for Kubeflow Training Operator" | ||
weight = 70 | ||
title = "Kubeflow Training" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about calling it "Kubeflow Trainer"? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this could be one of the options to name this project Kubeflow Trainer/KFTrainer.
@franciscojavierarceo Please can you explain why do you prefer this project name over the KFTraining ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would follow HuggingFace's and Lightning.ai's, "trainer" convention: https://huggingface.co/docs/transformers/en/main_classes/trainer
https://lightning.ai/docs/pytorch/stable/common/trainer.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have these:
Project Name:
Kubeflow Trainer
CRD Names:
TrainJob
TrainingRuntime
SDK APIs:
from kubeflow.trainer import TrainerClient, Trainer
TrainerClient().train(
Trainer(
func=train_func,
),
num_nodes=5,
runtime_ref="torch-distributed"
)
# For LLMs
from kubeflow.trainer import TrainerClient, Trainer, FineTuningConfig, LoraConfig
TrainerClient().train(
Trainer(
fine_tuning_config=FineTuningConfig(
peft_config=LoraConfig(
r=4,
)
),
num_nodes=5,
runtime_ref="llama-3.2-8b",
)
What do we think about these names ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @astefanutti @kubeflow/wg-training-leads @Electronic-Waste @deepanker13 @saileshd1402 @seanlaii @kannon92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I don't like that Kubeflow platform combines the products together? Tell me more! I love the idea of bringing components together but also love the idea of installing them separately. If I remember correctly, there is someone dying on that hill :-). That being said, we still don't have conformance.. we still don't have a unified definition of Kubeflow..so it might be a bit early to rename anything tbh. It feels like procrastination chores when we've got big fish to fry here.
Also,
For example, probably we don't want to rename Kubeflow Pipelines to Kubeflow Pipelines Service.
Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community? Who are our "customers" so to speak and how would they want APIs/Components labeled? This seems like a big decision to make in a vacuum and thought leadership is service (Kelsey has for sure played this game well). What if we opened it up to the greater community? Posted some options on poles via socials we as a community publish? We can then be more data driven. I bet @StefanoFioravanzo has some opinions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob
is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I don't like that Kubeflow platform combines the products together?
My points don't imply that Kubeflow products cannot function as standalone applications. However, for users interested in seeing how these individual open-source projects integrate seamlessly, the Kubeflow Platform provides a comprehensive, end-to-end machine learning experience.
Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community?
I believe it will take our community significantly more time to discuss this thoroughly (approximately 1–2 years given the user base of Kubeflow Pipelines).
In the meantime, we should focus on releasing our current project.
One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.
That is correct, but right now it is out of scope of supported CRDs (TrainJob and TrainingRuntime). Theoretically, users can leverage TrainingRuntimes and TrainJob for distributed inference with MPI, but it would be better if we create dedicated CRDs for it. Also, our kubeflow
Python SDK doesn't support it.
All in all, naming this project Kubeflow Trainer
will help avoid user confusion, considering the following user journey:
- Cluster Operators install Kubeflow Trainer controller manager into Kubernetes cluster.
- Cluster Operators configure the required Training Runtimes for ML users.
- ML Users use Kubeflow Python SDK to create TrainJob objects and interact with the Kubeflow Trainer APIs:
from kubeflow.trainer import TrainerClient
# Get available runtimes.
TrainerClient().list_runtimes()
# Train my ML model
TrainerClient().train(
runtime_ref="torch-distributed",
trainer=Trainer(
func=train_func,
func_args={"lr": 0.01},
num_nodes=100,
resources_per_node={"gpu": 5},
),
)
We can always revisit the project name in the future if users tell us that this experience is bad.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
------------- Kubeflow Platform -------------
- Kubeflow Workspaces <---- AI Model Development
- Kubeflow Spark <---- AI Data Processing
- Kubeflow Trainer <--- AI Model Training
- Kubeflow Optimizer <--- AI Model Optimization
- Kubeflow Model Registry <---- AI Model Management
- Kubeflow Pipelines <--- Run ML pipelines using the above tools
------------- Kubernetes --------------------
I would support this naming convention, it would be more clear to users than training-operator
and katib
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with Kubeflow Trainer
. It's challenging to find the name that can cover the whole of the use cases (ML Training / HPC Computing) and all use personals (ClusterOperators / ML Engineers / Researchers / Backend Engineers). I believe that Trainer
can flexibly cover all.
But I agree with here discussions since Trainer
does not exactly cover the whole of things. Everyone talks about the project name based on different points of view (personas and workload specifications)
Note that Previously, we were seeking the Kubeflow Batch
(or Job
) as an alternative project name.
However, we declined to introduce the Batch
as a name since we believe the Trainer
can imply MachineLearning semantics rather than Batch
.
Signed-off-by: Andrey Velichkevich <[email protected]>
@@ -122,7 +122,7 @@ trialSpec: | |||
``` | |||
|
|||
If you use `PyTorchJob` or other Training Operator jobs in your Trial template check | |||
[here](/docs/components/training/user-guides/tensorflow/#what-is-tfjob) how to set the annotation. | |||
[here](/docs/components/training/legacy-v1/user-guides/tensorflow/#what-is-tfjob) how to set the annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a comma between "template" and "check"?
weight = 10 | ||
+++ | ||
|
||
This document describes how to contribute to Kubeflow Training project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see inconsistent use of "This document". "This guide", "This page" throughout this PR. We can just start the sentence with something like "To contribute to the Kubeflow Training project... " or something similar. Just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, historically we've been using this in the beginning of the page:
This document describes how to ....
However, some pages have various messages.
@pdarshane From your point of view, how should we start our guides ?
cc @StefanoFioravanzo @varodrig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich based on previous conversation, we will not be having a specific page for the contribution/community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean @varodrig ? Even right now, individual Kubeflow projects have their own contributor guides:
- Kubeflow Training Operator: https://github.com/kubeflow/training-operator/blob/master/CONTRIBUTING.md
- Kubeflow Katib: https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md
- Spark Operator: https://www.kubeflow.org/docs/components/spark-operator/developer-guide/
I thought, we've just discussed whether we should cross-link these guides from the Kubeflow website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.
|
||
```python | ||
from kubeflow.training import TrainingClient | ||
Alternatively, you can install the latest Kubeflow Python SDK version directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can eliminate "you can" to reduce wordiness.
+++ | ||
|
||
{{% alert title="Old Version" color="warning" %}} | ||
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style guides typically recommend against using "please".
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we need to:
- Update Katib documentation to point to v2
- Update Contributions page https://www.kubeflow.org/docs/about/contributing/
- Update Community Page https://www.kubeflow.org/docs/about/community/
@@ -0,0 +1,5 @@ | |||
+++ | |||
title = "Community Guide" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is under discussion, given that other components do not have this content on the website.we want to ensure this is consistent across the website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.
@@ -0,0 +1,7 @@ | |||
+++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is under discussion, given that other components do not have this content on the website. we want to ensure this is consistent across the website.
|
||
## Getting Started with PyTorch | ||
|
||
TODO (andreyvelich): Add example from the Notebook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about
"This doc is in progress"
or just remove the section until is fully ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the getting started example once we finish this PR with @astefanutti: kubeflow/training-operator#2387
This page is about **Kubeflow Training Operator V1**, for the latest information check | ||
[the Kubeflow Trainer V2 documentation](/docs/components/trainer). | ||
|
||
Follow [this guide for migrating to Kubeflow Trainer V2](/docs/components/trainer/operator-guides/migration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my two cents here is that given that the component's name changed should we said V2? or just Kubeflow Trainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could just say:
Follow this guide for migrating to the new Kubeflow Trainer project.
WDYT @varodrig @kubeflow/wg-training-leads ?
@@ -0,0 +1,12 @@ | |||
+++ | |||
title = "Legacy (v1)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to do something like this Legacy Kubeflow Training Operator v1 (since the name changed just Legacy it's maybe not enough for users that are looking for the Kubeflow Training Operator docs ), so keeping the name it'll be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to update the images on this page that's making a reference to the Training Operator such as the Kubeflow Ecocystem and Kubeflow Components in the ML Lifecycle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I will update these images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the Training Operator Python SDK link to manage Training Operator jobs using Python APIs. in the Kubeflow APIs and SDKs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varodrig Where do you want to update it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich it's on the https://www.kubeflow.org/docs/started/architecture/#kubeflow-apis-and-sdks
let me know if this helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, let me update it!
Kubeflow Trainer is designed for two primary user personas, each with specific resources and | ||
responsibilities: | ||
|
||
<img src="/docs/components/trainer/images/user-personas.drawio.svg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the logos in the diagram is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varodrig Please can you point to the diagram which is broken ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the image is fine but when showing on the web page is not. So, one of the main problems is that the background is black, when loading on the page, because the background is white all the white content from personas titles, arrows, logos titles are not showing. the kubeflow python SDK is the one that is not showing the whole image properly.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which browser do you use @varodrig ?
@thesuperzapper Do you know what is the right way to export images from drawio to avoid such problems ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using Chrome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have same problems with @varodrig and I am using chrome as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes docs mostly don't use draw.io (some do, there are lots of diagrams). See https://kubernetes.io/docs/contribute/style/diagram-guide/ for what Kubernetes recommends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in installing the control plane,
when says : "if you have already installed kubeflow platform" is going to the latest version of kubeflow. Should we replace the link with the 1.9 release?
https://v1-9-branch.kubeflow.org/docs/started/installing-kubeflow/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the latest version of Kubeflow Platform 1.10 will also include Training Operator v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Next Steps section , the link is associated with the latest getting started guide instead of v1
Run your first Training Operator Job by following the Getting Started guide.
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just had a few minor doubts
Install the Kubeflow Python SDK to interact with Kubeflow Trainer APIs: | ||
|
||
```bash | ||
pip install kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when running this command, I see that the V1 SDK is installed. Should we wait for it to be updated to add it here, or is it fine keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point!
Let me remove it until we release SDK to PyPI and keep only command that installs SDK from the source.
@@ -16,7 +16,7 @@ Istio [automatic sidecar injection](https://istio.io/v1.3/docs/setup/additional- | |||
In order to get it running, it needs annotation `sidecar.istio.io/inject: "false"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it is a great find! I think, we should link the examples from the release-1.9
branch.
Let me update it.
Add info about Kubeflow Python SDK Signed-off-by: Andrey Velichkevich <[email protected]>
If you don't have Kubernetes cluster, you can quickly create one locally using [Kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-with-a-package-manager): | ||
|
||
```bash | ||
brew install kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that line can be removed as it's specific to MacOS and at that stage we can assume the user has installed Kind or have a Kubernetes cluster already.
## Prerequisites | ||
|
||
Ensure that you have access to a Kubernetes cluster with Kubeflow Trainer | ||
control plane installed. If it is not set up yet, followÍ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control plane installed. If it is not set up yet, followÍ | |
control plane installed. If it is not set up yet, follow |
Ensure that you have access to a Kubernetes cluster with Kubeflow Trainer | ||
control plane installed. If it is not set up yet, followÍ | ||
[the installation guide](/docs/components/trainer/operator-guides/installation) to quickly deploy | ||
Kubeflow Trainer on your local Kind cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubeflow Trainer on your local Kind cluster. | |
Kubeflow Trainer. |
It may be just better straight "quickly deploy Kubeflow Trainer". The provided link gives the specifics.
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
8192869
to
f2afda3
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments:)
- [Kubeflow Python SDK](https://github.com/kubeflow/training-operator/blob/master/sdk_v2/kubeflow/training/api/training_client.py) | ||
to interact with Kubeflow Trainer APIs and to manage TrainJobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename it to Trainer Python SDK
since we don't have kubeflow/sdk
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we want to rename the kubeflow_training
SDK to kubeflow
SDK: https://github.com/kubeflow/training-operator/blob/master/sdk_v2/pyproject.toml#L7.
We will not push it to PyPI yet, until we finalize the proposal of creation a new kubeflow/sdk
repo.
Thus, I prefer we call it Kubeflow SDK in the docs.
Signed-off-by: Andrey Velichkevich <[email protected]>
a1663be
to
34532f9
Compare
Fixes: kubeflow/training-operator#2214
This is initial version for Kubeflow Training V2 docs.
Please let me know what do you think.
TODOs:
/cc @kubeflow/wg-training-leads @kubeflow/release-team @hbelmiro @varodrig @jbottum @varshaprasad96 @akshaychitneni @helenxie-bit @Electronic-Waste @saileshd1402 @seanlaii @deepanker13 @astefanutti @shravan-achar @kannon92 @droctothorpe @sandipanpanda @vsoch @franciscojavierarceo @Syulin7 @StefanoFioravanzo @kuizhiqing