Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 14, 2025

Fixes: kubeflow/training-operator#2214

This is initial version for Kubeflow Training V2 docs.
Please let me know what do you think.

TODOs:

  • Add working Getting Started example.
  • Fix the installation scripts.
  • Add new logo of Kubeflow Training.
  • Rename Kubeflow Training Operator -> Kubeflow Training everywhere.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @hbelmiro @varodrig @jbottum @varshaprasad96 @akshaychitneni @helenxie-bit @Electronic-Waste @saileshd1402 @seanlaii @deepanker13 @astefanutti @shravan-achar @kannon92 @droctothorpe @sandipanpanda @vsoch @franciscojavierarceo @Syulin7 @StefanoFioravanzo @kuizhiqing

Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: astefanutti, kannon92, shravan-achar, vsoch, kubeflow/wg-training-leads, kubeflow/release-team, akshaychitneni, seanlaii, varshaprasad96, saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Fixes: kubeflow/training-operator#2214

This is initial version for Kubeflow Training V2 docs.
Please let me know what do you think.

TODOs:

  • Add working Getting Started example.
  • Fix the installation scripts.
  • Add new logo of Kubeflow Training.
  • Rename Kubeflow Training Operator -> Kubeflow Training everywhere.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @hbelmiro @varodrig @jbottum @varshaprasad96 @akshaychitneni @helenxie-bit @Electronic-Waste @saileshd1402 @seanlaii @deepanker13 @astefanutti @shravan-achar @kannon92 @droctothorpe @sandipanpanda @vsoch @franciscojavierarceo @Syulin7 @StefanoFioravanzo @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot requested a review from deepanker13 January 14, 2025 02:38
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 7 to 11
{{% alert title="Old Version" color="warning" %}}
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information.

Please follow [this guide for migrating to Kubeflow Training V2](/docs/components/training/admin-guides/migration)
{{% /alert %}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if that message looks good to you @kubeflow/wg-training-leads @rimolive @varodrig @hbelmiro @StefanoFioravanzo.
If yes, I will add it to all Kubeflow Training V1 docs, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/legacy-v1/overview/quickstart/

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Great contributions! I left some initial comments for you.

Comment on lines 16 to 17
- Kubernetes >= 1.27
- `kubectl` >= 1.27
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me update it.


```

## Installing the Kubeflow Training Runtimes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it will be better if we could provide a standalone installation guide:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am planning to refactor our manifests given that the Cluster Training Runtime needs to be installed after the manager.
I will soon submit a PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess installing the cluster training runtime requires the CRDs to be installed first, more than the manager to be deployed. Projects tend to separate the steps to install CRDs first, then the rest of the manifests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we also need to install manager before we deploy the CTR, since we perform validation and mutation via webhook.

Comment on lines 15 to 17
- Kubernetes >= 1.27
- `kubectl` >= 1.27
- Python >= 3.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Fix links

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
TODO (andreyvelich): Change the link once V1 is removed.

```bash
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/v2/overlays/manager?ref=master"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't most copy paste a full URL that starts with https?

Copy link
Member Author

@andreyvelich andreyvelich Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is how you can use remote URL with kubectl and kustomize (e.g. they accept the SSH url: github.com:kubeflow/training-operator.git )


## Prerequisites

These are the minimal requirements to install the Training Operator:
Ensure that you have access to a Kubernetes clusters with the Kubeflow Training
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ensure that you have access to a Kubernetes clusters with the Kubeflow Training
Ensure that you have access to a Kubernetes cluster with the Kubeflow Training

Comment on lines 13 to 14
[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy
Kubeflow Training on a local Kind cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy
Kubeflow Training on a local Kind cluster.
[the installation guide](/docs/components/training/admin-guides/installation) to deploy
Kubeflow Training.

No reason it needs to be kind, and if there are webhooks it won't be that quick :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally add this message to tell that it is super easy to deploy Kubernetes locally, and quickly try Kubeflow Training. I don't want to scare our ML users with "Kubernetes" dependency.
@kubeflow/wg-training-leads @vsoch @astefanutti @franciscojavierarceo @StefanoFioravanzo Any thoughts on how we can phrase it better in docs ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @vsoch suggestion, it doesn't seem there is a need to be too specific here, "quickly deploy Kubeflow Trainer" straight makes it even less scary here and do not exclude non-Kind options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think we can make this message better, especially for those users who don't know what is Kubernetes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Installing the Training Operator
You can chose between installing the latest stable release of the development version from
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can chose between installing the latest stable release of the development version from
You can chose between installing the latest stable release or the development version from

PyTorch, TensorFlow, XGBoost, JAX, and others.
## What is the Kubeflow Training

The Kubeflow Training is a Kubernetes-native project for large language models (LLMs) fine-tuning
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Kubeflow Training is a Kubernetes-native project for large language models (LLMs) fine-tuning
The Kubeflow Training project is a Kubernetes-native project for large language models (LLMs) fine-tuning

with the Training Operator to orchestrate their ML training on Kubernetes.
with the Kubeflow Training to orchestrate their ML training on Kubernetes.

The Kubeflow Training allows you effortlessly develop your LLMs with the Kubeflow Python SDK and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto (and above this)

supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
please follow [this guide](/docs/components/training/user-guides/mpi/) to install MPI Operator V2.
The Kubeflow Training is designed for two primary user personas, each with specific resources and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

various distributed training strategies for different ML frameworks.
### User Personas

The Kubeflow Training documentation is separated between these user personas:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto


## Why use the Kubeflow Training

The Kubeflow Training supports key phases on the AI/ML lifecycle, including model training and LLMs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2214-kubeflow-training-v2 branch from 434733e to c8d5eff Compare January 14, 2025 20:44
Signed-off-by: Andrey Velichkevich <[email protected]>
/docs/components/training/user-guides/prometheus /docs/components/training/legacy-v1/user-guides/prometheus
/docs/components/training/user-guides/tensorflow /docs/components/training/legacy-v1/user-guides/tensorflow
/docs/components/training/user-guides/xgboost /docs/components/training/legacy-v1/user-guides/xgboost
Copy link
Member Author

@andreyvelich andreyvelich Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed with @thesuperzapper how to structure the new docs, and he suggested that we put new user guides under
/docs/components/training/user-guides-v2
For example: https://deploy-preview-3958--competent-brattain-de2d6d.netlify.app/docs/components/training/user-guides-v2/pytorch/

That will allow us to redirect the V1 docs (e.g. TFJob) to the correct location.

What do you think about it @kubeflow/wg-training-leads ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich make sure you also redirect the "index" pages e.g.

  • /docs/components/training/explanation/ -> /docs/components/training/legacy-v1/explanation/
  • /docs/components/training/user-guides/ -> /docs/components/training/legacy-v1/user-guides/
  • /docs/components/training/reference/ -> /docs/components/training/legacy-v1/reference/

Also, I am not sure if it matters, but all other redirects use a trailing / slash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let me fix that.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
title = "Training Operator"
description = "Documentation for Kubeflow Training Operator"
weight = 70
title = "Kubeflow Training"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling it "Kubeflow Trainer"? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this could be one of the options to name this project Kubeflow Trainer/KFTrainer.
@franciscojavierarceo Please can you explain why do you prefer this project name over the KFTraining ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@andreyvelich andreyvelich Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have these:
Project Name:
Kubeflow Trainer

CRD Names:

  • TrainJob
  • TrainingRuntime

SDK APIs:

from kubeflow.trainer import TrainerClient, Trainer

TrainerClient().train(
    Trainer(
        func=train_func,
    ),
    num_nodes=5,
    runtime_ref="torch-distributed"
)

# For LLMs
from kubeflow.trainer import TrainerClient, Trainer, FineTuningConfig, LoraConfig
TrainerClient().train(
    Trainer(
        fine_tuning_config=FineTuningConfig(
            peft_config=LoraConfig(
                r=4,
            )
    ),
    num_nodes=5, 
    runtime_ref="llama-3.2-8b",
)

What do we think about these names ?

Copy link
Member Author

@andreyvelich andreyvelich Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@chasecadet chasecadet Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I don't like that Kubeflow platform combines the products together? Tell me more! I love the idea of bringing components together but also love the idea of installing them separately. If I remember correctly, there is someone dying on that hill :-). That being said, we still don't have conformance.. we still don't have a unified definition of Kubeflow..so it might be a bit early to rename anything tbh. It feels like procrastination chores when we've got big fish to fry here.

Also,

For example, probably we don't want to rename Kubeflow Pipelines to Kubeflow Pipelines Service.

Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community? Who are our "customers" so to speak and how would they want APIs/Components labeled? This seems like a big decision to make in a vacuum and thought leadership is service (Kelsey has for sure played this game well). What if we opened it up to the greater community? Posted some options on poles via socials we as a community publish? We can then be more data driven. I bet @StefanoFioravanzo has some opinions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.

Copy link
Member Author

@andreyvelich andreyvelich Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I don't like that Kubeflow platform combines the products together?

My points don't imply that Kubeflow products cannot function as standalone applications. However, for users interested in seeing how these individual open-source projects integrate seamlessly, the Kubeflow Platform provides a comprehensive, end-to-end machine learning experience.

Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community?

I believe it will take our community significantly more time to discuss this thoroughly (approximately 1–2 years given the user base of Kubeflow Pipelines).
In the meantime, we should focus on releasing our current project.

One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.

That is correct, but right now it is out of scope of supported CRDs (TrainJob and TrainingRuntime). Theoretically, users can leverage TrainingRuntimes and TrainJob for distributed inference with MPI, but it would be better if we create dedicated CRDs for it. Also, our kubeflow Python SDK doesn't support it.


All in all, naming this project Kubeflow Trainer will help avoid user confusion, considering the following user journey:

  1. Cluster Operators install Kubeflow Trainer controller manager into Kubernetes cluster.
  2. Cluster Operators configure the required Training Runtimes for ML users.
  3. ML Users use Kubeflow Python SDK to create TrainJob objects and interact with the Kubeflow Trainer APIs:
from kubeflow.trainer import TrainerClient

# Get available runtimes.
TrainerClient().list_runtimes()

# Train my ML model
TrainerClient().train(
    runtime_ref="torch-distributed",
    trainer=Trainer(
        func=train_func,
        func_args={"lr": 0.01},
        num_nodes=100,
        resources_per_node={"gpu": 5},
    ),
)

We can always revisit the project name in the future if users tell us that this experience is bad.
What do you think ?

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

------------- Kubeflow Platform -------------
- Kubeflow Workspaces          <---- AI Model Development
- Kubeflow Spark               <---- AI Data Processing
- Kubeflow Trainer             <--- AI Model Training
- Kubeflow Optimizer           <--- AI Model Optimization
- Kubeflow Model Registry      <---- AI Model Management
- Kubeflow Pipelines           <--- Run ML pipelines using the above tools
------------- Kubernetes --------------------

I would support this naming convention, it would be more clear to users than training-operator and katib.

Also cc👀 @Doris-xm @truc0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with Kubeflow Trainer. It's challenging to find the name that can cover the whole of the use cases (ML Training / HPC Computing) and all use personals (ClusterOperators / ML Engineers / Researchers / Backend Engineers). I believe that Trainer can flexibly cover all.

But I agree with here discussions since Trainer does not exactly cover the whole of things. Everyone talks about the project name based on different points of view (personas and workload specifications)

Note that Previously, we were seeking the Kubeflow Batch (or Job) as an alternative project name.
However, we declined to introduce the Batch as a name since we believe the Trainer can imply MachineLearning semantics rather than Batch.

Signed-off-by: Andrey Velichkevich <[email protected]>
@@ -122,7 +122,7 @@ trialSpec:
```

If you use `PyTorchJob` or other Training Operator jobs in your Trial template check
[here](/docs/components/training/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.
[here](/docs/components/training/legacy-v1/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a comma between "template" and "check"?

weight = 10
+++

This document describes how to contribute to Kubeflow Training project.
Copy link
Contributor

@pdarshane pdarshane Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see inconsistent use of "This document". "This guide", "This page" throughout this PR. We can just start the sentence with something like "To contribute to the Kubeflow Training project... " or something similar. Just a suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, historically we've been using this in the beginning of the page:

This document describes how to ....

However, some pages have various messages.
@pdarshane From your point of view, how should we start our guides ?
cc @StefanoFioravanzo @varodrig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich based on previous conversation, we will not be having a specific page for the contribution/community.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean @varodrig ? Even right now, individual Kubeflow projects have their own contributor guides:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.


```python
from kubeflow.training import TrainingClient
Alternatively, you can install the latest Kubeflow Python SDK version directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate "you can" to reduce wordiness.

+++

{{% alert title="Old Version" color="warning" %}}
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style guides typically recommend against using "please".

@andreyvelich andreyvelich changed the title [WIP] Training: Initial Documentation for Kubeflow Training V2 [WIP] Training: Initial Documentation for Kubeflow Trainer V2 Jan 17, 2025
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Jan 17, 2025
Copy link
Contributor

@varodrig varodrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we need to:

@@ -0,0 +1,5 @@
+++
title = "Community Guide"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is under discussion, given that other components do not have this content on the website.we want to ensure this is consistent across the website.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.

@@ -0,0 +1,7 @@
+++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is under discussion, given that other components do not have this content on the website. we want to ensure this is consistent across the website.


## Getting Started with PyTorch

TODO (andreyvelich): Add example from the Notebook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about
"This doc is in progress"

or just remove the section until is fully ready

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the getting started example once we finish this PR with @astefanutti: kubeflow/training-operator#2387

This page is about **Kubeflow Training Operator V1**, for the latest information check
[the Kubeflow Trainer V2 documentation](/docs/components/trainer).

Follow [this guide for migrating to Kubeflow Trainer V2](/docs/components/trainer/operator-guides/migration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my two cents here is that given that the component's name changed should we said V2? or just Kubeflow Trainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just say:
Follow this guide for migrating to the new Kubeflow Trainer project.
WDYT @varodrig @kubeflow/wg-training-leads ?

@@ -0,0 +1,12 @@
+++
title = "Legacy (v1)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to do something like this Legacy Kubeflow Training Operator v1 (since the name changed just Legacy it's maybe not enough for users that are looking for the Kubeflow Training Operator docs ), so keeping the name it'll be helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update the images on this page that's making a reference to the Training Operator such as the Kubeflow Ecocystem and Kubeflow Components in the ML Lifecycle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I will update these images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the Training Operator Python SDK link to manage Training Operator jobs using Python APIs. in the Kubeflow APIs and SDKs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varodrig Where do you want to update it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let me update it!

Kubeflow Trainer is designed for two primary user personas, each with specific resources and
responsibilities:

<img src="/docs/components/trainer/images/user-personas.drawio.svg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the logos in the diagram is broken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varodrig Please can you point to the diagram which is broken ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the image is fine but when showing on the web page is not. So, one of the main problems is that the background is black, when loading on the page, because the background is white all the white content from personas titles, arrows, logos titles are not showing. the kubeflow python SDK is the one that is not showing the whole image properly.
Screenshot 2025-01-20 at 6 55 28 pm
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, for me the rendering works correct.

Screenshot 2025-01-21 at 00 48 05

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which browser do you use @varodrig ?
@thesuperzapper Do you know what is the right way to export images from drawio to avoid such problems ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using Chrome

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have same problems with @varodrig and I am using chrome as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim @Arhell Do you know why are we getting this issue with Hugo ?
Does Kubernetes also use drawio for the diagrams ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes docs mostly don't use draw.io (some do, there are lots of diagrams). See https://kubernetes.io/docs/contribute/style/diagram-guide/ for what Kubernetes recommends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in installing the control plane,
when says : "if you have already installed kubeflow platform" is going to the latest version of kubeflow. Should we replace the link with the 1.9 release?

https://v1-9-branch.kubeflow.org/docs/started/installing-kubeflow/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the latest version of Kubeflow Platform 1.10 will also include Training Operator v1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Next Steps section , the link is associated with the latest getting started guide instead of v1

Run your first Training Operator Job by following the Getting Started guide.

Copy link

@saileshd1402 saileshd1402 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just had a few minor doubts

Install the Kubeflow Python SDK to interact with Kubeflow Trainer APIs:

```bash
pip install kubeflow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently when running this command, I see that the V1 SDK is installed. Should we wait for it to be updated to add it here, or is it fine keep it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point!
Let me remove it until we release SDK to PyPI and keep only command that installs SDK from the source.

@@ -16,7 +16,7 @@ Istio [automatic sidecar injection](https://istio.io/v1.3/docs/setup/additional-
In order to get it running, it needs annotation `sidecar.istio.io/inject: "false"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be worried about links to examples/repo in documentation since we are moving V1 to a separate branch? For example: line 23, line 29 of this file. This would apply to most example job docs AFAIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is a great find! I think, we should link the examples from the release-1.9 branch.
Let me update it.

Add info about Kubeflow Python SDK

Signed-off-by: Andrey Velichkevich <[email protected]>
If you don't have Kubernetes cluster, you can quickly create one locally using [Kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-with-a-package-manager):

```bash
brew install kind
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that line can be removed as it's specific to MacOS and at that stage we can assume the user has installed Kind or have a Kubernetes cluster already.

## Prerequisites

Ensure that you have access to a Kubernetes cluster with Kubeflow Trainer
control plane installed. If it is not set up yet, followÍ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
control plane installed. If it is not set up yet, followÍ
control plane installed. If it is not set up yet, follow

Ensure that you have access to a Kubernetes cluster with Kubeflow Trainer
control plane installed. If it is not set up yet, followÍ
[the installation guide](/docs/components/trainer/operator-guides/installation) to quickly deploy
Kubeflow Trainer on your local Kind cluster.
Copy link
Contributor

@astefanutti astefanutti Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kubeflow Trainer on your local Kind cluster.
Kubeflow Trainer.

It may be just better straight "quickly deploy Kubeflow Trainer". The provided link gives the specifics.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 27, 2025
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2214-kubeflow-training-v2 branch from 8192869 to f2afda3 Compare January 27, 2025 16:09
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments:)

Comment on lines +141 to +142
- [Kubeflow Python SDK](https://github.com/kubeflow/training-operator/blob/master/sdk_v2/kubeflow/training/api/training_client.py)
to interact with Kubeflow Trainer APIs and to manage TrainJobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename it to Trainer Python SDK since we don't have kubeflow/sdk now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we want to rename the kubeflow_training SDK to kubeflow SDK: https://github.com/kubeflow/training-operator/blob/master/sdk_v2/pyproject.toml#L7.
We will not push it to PyPI yet, until we finalize the proposal of creation a new kubeflow/sdk repo.
Thus, I prefer we call it Kubeflow SDK in the docs.

content/en/docs/components/trainer/overview.md Outdated Show resolved Hide resolved
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2214-kubeflow-training-v2 branch from a1663be to 34532f9 Compare January 30, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Update documentation for V2 APIs