[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958

andreyvelich · 2025-01-14T02:38:26Z

This is initial version for Kubeflow Training V2 docs.
Please let me know what do you think.

TODOs:

Add working Getting Started example.
Fix the installation scripts.
Add new logo of Kubeflow Training.
Rename Kubeflow Training Operator -> Kubeflow Training everywhere.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @hbelmiro @varodrig @jbottum @varshaprasad96 @akshaychitneni @helenxie-bit @Electronic-Waste @saileshd1402 @seanlaii @deepanker13 @astefanutti @shravan-achar @kannon92 @droctothorpe @sandipanpanda @vsoch @franciscojavierarceo @Syulin7 @StefanoFioravanzo @kuizhiqing

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow · 2025-01-14T02:38:39Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: astefanutti, kannon92, shravan-achar, vsoch, kubeflow/wg-training-leads, kubeflow/release-team, akshaychitneni, seanlaii, varshaprasad96, saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Fixes: kubeflow/training-operator#2214

This is initial version for Kubeflow Training V2 docs.
Please let me know what do you think.

TODOs:

Add working Getting Started example.

Fix the installation scripts.

Add new logo of Kubeflow Training.

Rename Kubeflow Training Operator -> Kubeflow Training everywhere.

/cc @kubeflow/wg-training-leads @kubeflow/release-team @hbelmiro @varodrig @jbottum @varshaprasad96 @akshaychitneni @helenxie-bit @Electronic-Waste @saileshd1402 @seanlaii @deepanker13 @astefanutti @shravan-achar @kannon92 @droctothorpe @sandipanpanda @vsoch @franciscojavierarceo @Syulin7 @StefanoFioravanzo @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2025-01-14T02:38:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-01-14T02:41:52Z

content/en/docs/components/training/legacy-v1/_index.md

+{{% alert title="Old Version" color="warning" %}}
+This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information.
+
+Please follow [this guide for migrating to Kubeflow Training V2](/docs/components/training/admin-guides/migration)
+{{% /alert %}}


Let me know if that message looks good to you @kubeflow/wg-training-leads @rimolive @varodrig @hbelmiro @StefanoFioravanzo.
If yes, I will add it to all Kubeflow Training V1 docs, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/legacy-v1/overview/quickstart/

Electronic-Waste

@andreyvelich Great contributions! I left some initial comments for you.

Electronic-Waste · 2025-01-14T02:45:05Z

content/en/docs/components/training/admin-guides/installation.md

+- Kubernetes >= 1.27
+- `kubectl` >= 1.27


AFAIK, we've removed our support for 1.27: https://github.com/kubeflow/training-operator/blob/1dfa40c12516fc9eb2ce12c5ef52da7d46670457/.github/workflows/unittests.yaml#L21

Good point, let me update it.

Electronic-Waste · 2025-01-14T02:46:38Z

content/en/docs/components/training/admin-guides/installation.md

+
+```
+
+## Installing the Kubeflow Training Runtimes


I think, it will be better if we could provide a standalone installation guide:)

Actually, I am planning to refactor our manifests given that the Cluster Training Runtime needs to be installed after the manager.
I will soon submit a PR.

I would guess installing the cluster training runtime requires the CRDs to be installed first, more than the manager to be deployed. Projects tend to separate the steps to install CRDs first, then the rest of the manifests.

Actually, we also need to install manager before we deploy the CTR, since we perform validation and mutation via webhook.

Electronic-Waste · 2025-01-14T02:48:36Z

content/en/docs/components/training/legacy-v1/installation.md

+- Kubernetes >= 1.27
+- `kubectl` >= 1.27
+- Python >= 3.7


Same as above

Fix links Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

content/en/docs/components/training/admin-guides/installation.md

vsoch · 2025-01-14T13:14:08Z

content/en/docs/components/training/admin-guides/installation.md

+TODO (andreyvelich): Change the link once V1 is removed.
+
+```bash
+kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/v2/overlays/manager?ref=master"


Don't most copy paste a full URL that starts with https?

Actually, this is how you can use remote URL with kubectl and kustomize (e.g. they accept the SSH url: github.com:kubeflow/training-operator.git )

vsoch · 2025-01-14T13:16:02Z

content/en/docs/components/training/installation.md


 ## Prerequisites

-These are the minimal requirements to install the Training Operator:
+Ensure that you have access to a Kubernetes clusters with the Kubeflow Training


Suggested change

Ensure that you have access to a Kubernetes clusters with the Kubeflow Training

Ensure that you have access to a Kubernetes cluster with the Kubeflow Training

vsoch · 2025-01-14T13:16:54Z

content/en/docs/components/training/installation.md

+[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy
+Kubeflow Training on a local Kind cluster.


Suggested change

[the installation guide](/docs/components/training/admin-guides/installation) to quickly deploy

Kubeflow Training on a local Kind cluster.

[the installation guide](/docs/components/training/admin-guides/installation) to deploy

Kubeflow Training.

No reason it needs to be kind, and if there are webhooks it won't be that quick :P

I intentionally add this message to tell that it is super easy to deploy Kubernetes locally, and quickly try Kubeflow Training. I don't want to scare our ML users with "Kubernetes" dependency.
@kubeflow/wg-training-leads @vsoch @astefanutti @franciscojavierarceo @StefanoFioravanzo Any thoughts on how we can phrase it better in docs ?

I agree with @vsoch suggestion, it doesn't seem there is a need to be too specific here, "quickly deploy Kubeflow Trainer" straight makes it even less scary here and do not exclude non-Kind options.

How do you think we can make this message better, especially for those users who don't know what is Kubernetes?

@vsoch @astefanutti ?

vsoch · 2025-01-14T13:17:10Z

content/en/docs/components/training/installation.md


-## Installing the Training Operator
+You can chose between installing the latest stable release of the development version from


Suggested change

You can chose between installing the latest stable release of the development version from

You can chose between installing the latest stable release or the development version from

vsoch · 2025-01-14T13:26:15Z