Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion content/en/_redirects
Original file line number Diff line number Diff line change
Expand Up @@ -337,4 +337,22 @@ docs/started/requirements/ /docs/started/getting-started/
/docs/components/pipelines/v2/reference/api/kubeflow-pipeline-api-spec/ /docs/components/pipelines/reference/api/kubeflow-pipeline-api-spec/
/docs/components/pipelines/v2/reference/sdk/ /docs/components/pipelines/reference/sdk/
/docs/components/pipelines/v2/run-a-pipeline/ /docs/components/pipelines/user-guides/core-functions/run-a-pipeline/
/docs/components/pipelines/v2/version-compatibility/ /docs/components/pipelines/reference/version-compatibility/
/docs/components/pipelines/v2/version-compatibility/ /docs/components/pipelines/reference/version-compatibility/

# Kubeflow Training V2 (https://github.com/kubeflow/training-operator/issues/2214)
/docs/components/training/installation/ /docs/components/training/legacy-v1/installation/
/docs/components/training/explanation/ /docs/components/training/legacy-v1/explanation/
/docs/components/training/explanation/fine-tuning/ /docs/components/training/legacy-v1/explanation/fine-tuning/
/docs/components/training/reference/ /docs/components/training/legacy-v1/reference/
/docs/components/training/reference/architecture/ /docs/components/training/legacy-v1/reference/architecture/
/docs/components/training/reference/distributed-training/ /docs/components/training/legacy-v1/reference/distributed-training/
/docs/components/training/reference/fine-tuning/ /docs/components/training/legacy-v1/reference/fine-tuning/
/docs/components/training/user-guides/ /docs/components/training/legacy-v1/user-guides/
/docs/components/training/user-guides/fine-tuning/ /docs/components/training/legacy-v1/user-guides/fine-tuning/
/docs/components/training/user-guides/jax/ /docs/components/training/legacy-v1/user-guides/jax/
/docs/components/training/user-guides/job-scheduling/ /docs/components/training/legacy-v1/user-guides/job-scheduling/
/docs/components/training/user-guides/mpi/ /docs/components/training/legacy-v1/user-guides/mpi/
/docs/components/training/user-guides/paddle/ /docs/components/training/legacy-v1/user-guides/paddle/
/docs/components/training/user-guides/prometheus/ /docs/components/training/legacy-v1/user-guides/prometheus/
/docs/components/training/user-guides/tensorflow/ /docs/components/training/legacy-v1/user-guides/tensorflow/
/docs/components/training/user-guides/xgboost/ /docs/components/training/legacy-v1/user-guides/xgboost/
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ trialSpec:
```
If you use `PyTorchJob` or other Training Operator jobs in your Trial template check
[here](/docs/components/training/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.
[here](/docs/components/training/legacy-v1/user-guides/tensorflow/#what-is-tfjob) how to set the annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a comma between "template" and "check"?


## Running the Experiment

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ In Katib examples, you can find the following examples for Trial's Workers:

- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/)

- [Kubeflow `TFJob`](/docs/components/training/user-guides/tensorflow)
- [Kubeflow `TFJob`](/docs/components/training/legacy-v1/user-guides/tensorflow)

- [Kubeflow `PyTorchJob`](/docs/components/training/user-guides/pytorch/)
- [Kubeflow `PyTorchJob`](/docs/components/training/legacy-v1/user-guides/pytorch/)

- [Kubeflow `XGBoostJob`](/docs/components/training/user-guides/xgboost)
- [Kubeflow `XGBoostJob`](/docs/components/training/legacy-v1/user-guides/xgboost)

- [Kubeflow `MPIJob`](/docs/components/training/user-guides/mpi)
- [Kubeflow `MPIJob`](/docs/components/training/legacy-v1/user-guides/mpi)

- [Tekton `Pipelines`](https://github.com/kubeflow/katib/tree/master/examples/v1beta1/tekton)

Expand Down
6 changes: 3 additions & 3 deletions content/en/docs/components/training/_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
+++
title = "Training Operator"
description = "Documentation for Kubeflow Training Operator"
weight = 70
title = "Kubeflow Training"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling it "Kubeflow Trainer"? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this could be one of the options to name this project Kubeflow Trainer/KFTrainer.
@franciscojavierarceo Please can you explain why do you prefer this project name over the KFTraining ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@andreyvelich andreyvelich Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have these:
Project Name:
Kubeflow Trainer

CRD Names:

  • TrainJob
  • TrainingRuntime

SDK APIs:

from kubeflow.trainer import TrainerClient, Trainer

TrainerClient().train(
    Trainer(
        func=train_func,
    ),
    num_nodes=5,
    runtime_ref="torch-distributed"
)

# For LLMs
from kubeflow.trainer import TrainerClient, Trainer, FineTuningConfig, LoraConfig
TrainerClient().train(
    Trainer(
        fine_tuning_config=FineTuningConfig(
            peft_config=LoraConfig(
                r=4,
            )
    ),
    num_nodes=5, 
    runtime_ref="llama-3.2-8b",
)

What do we think about these names ?

Copy link
Member Author

@andreyvelich andreyvelich Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@chasecadet chasecadet Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I don't like that Kubeflow platform combines the products together? Tell me more! I love the idea of bringing components together but also love the idea of installing them separately. If I remember correctly, there is someone dying on that hill :-). That being said, we still don't have conformance.. we still don't have a unified definition of Kubeflow..so it might be a bit early to rename anything tbh. It feels like procrastination chores when we've got big fish to fry here.

Also,

For example, probably we don't want to rename Kubeflow Pipelines to Kubeflow Pipelines Service.

Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community? Who are our "customers" so to speak and how would they want APIs/Components labeled? This seems like a big decision to make in a vacuum and thought leadership is service (Kelsey has for sure played this game well). What if we opened it up to the greater community? Posted some options on poles via socials we as a community publish? We can then be more data driven. I bet @StefanoFioravanzo has some opinions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.

Copy link
Member Author

@andreyvelich andreyvelich Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I don't like that Kubeflow platform combines the products together?

My points don't imply that Kubeflow products cannot function as standalone applications. However, for users interested in seeing how these individual open-source projects integrate seamlessly, the Kubeflow Platform provides a comprehensive, end-to-end machine learning experience.

Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community?

I believe it will take our community significantly more time to discuss this thoroughly (approximately 1–2 years given the user base of Kubeflow Pipelines).
In the meantime, we should focus on releasing our current project.

One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, PyTorchJob is not limited to PyTorch trainer but more like PyTorch distributed which offers primitives to and abstractions for parallelism, sharding, and communications.

That is correct, but right now it is out of scope of supported CRDs (TrainJob and TrainingRuntime). Theoretically, users can leverage TrainingRuntimes and TrainJob for distributed inference with MPI, but it would be better if we create dedicated CRDs for it. Also, our kubeflow Python SDK doesn't support it.


All in all, naming this project Kubeflow Trainer will help avoid user confusion, considering the following user journey:

  1. Cluster Operators install Kubeflow Trainer controller manager into Kubernetes cluster.
  2. Cluster Operators configure the required Training Runtimes for ML users.
  3. ML Users use Kubeflow Python SDK to create TrainJob objects and interact with the Kubeflow Trainer APIs:
from kubeflow.trainer import TrainerClient

# Get available runtimes.
TrainerClient().list_runtimes()

# Train my ML model
TrainerClient().train(
    runtime_ref="torch-distributed",
    trainer=Trainer(
        func=train_func,
        func_args={"lr": 0.01},
        num_nodes=100,
        resources_per_node={"gpu": 5},
    ),
)

We can always revisit the project name in the future if users tell us that this experience is bad.
What do you think ?

Copy link
Member

@Electronic-Waste Electronic-Waste Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

------------- Kubeflow Platform -------------
- Kubeflow Workspaces          <---- AI Model Development
- Kubeflow Spark               <---- AI Data Processing
- Kubeflow Trainer             <--- AI Model Training
- Kubeflow Optimizer           <--- AI Model Optimization
- Kubeflow Model Registry      <---- AI Model Management
- Kubeflow Pipelines           <--- Run ML pipelines using the above tools
------------- Kubernetes --------------------

I would support this naming convention, it would be more clear to users than training-operator and katib.

Also cc👀 @Doris-xm @truc0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with Kubeflow Trainer. It's challenging to find the name that can cover the whole of the use cases (ML Training / HPC Computing) and all use personals (ClusterOperators / ML Engineers / Researchers / Backend Engineers). I believe that Trainer can flexibly cover all.

But I agree with here discussions since Trainer does not exactly cover the whole of things. Everyone talks about the project name based on different points of view (personas and workload specifications)

Note that Previously, we were seeking the Kubeflow Batch (or Job) as an alternative project name.
However, we declined to introduce the Batch as a name since we believe the Trainer can imply MachineLearning semantics rather than Batch.

description = "Documentation for Kubeflow Training"
weight = 20
+++
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
+++
title = "Contributor Guides"
description = "Documentation for Kubeflow Training contributors"
weight = 60
+++
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
+++
title = "Contributing Guide"
description = "How to contribute to Kubeflow Training project"
weight = 10
+++

This document describes how to contribute to Kubeflow Training project.
Copy link
Contributor

@pdarshane pdarshane Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see inconsistent use of "This document". "This guide", "This page" throughout this PR. We can just start the sentence with something like "To contribute to the Kubeflow Training project... " or something similar. Just a suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, historically we've been using this in the beginning of the page:

This document describes how to ....

However, some pages have various messages.
@pdarshane From your point of view, how should we start our guides ?
cc @StefanoFioravanzo @varodrig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich based on previous conversation, we will not be having a specific page for the contribution/community.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean @varodrig ? Even right now, individual Kubeflow projects have their own contributor guides:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions
#3971
but the main idea is to not have individual pages on each project on the website, but continue one centralized place on the website and links to the git repos.

158 changes: 17 additions & 141 deletions content/en/docs/components/training/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,158 +1,34 @@
+++
title = "Getting Started"
description = "Get started with the Training Operator"
description = "Get Started with Kubeflow Training"
weight = 30
+++

This guide describes how to get started with the Training Operator and run a few simple examples.
This guide describes how to get started with Kubeflow Training and run distributed training
with PyTorch.

## Prerequisites

You need to install the following components to run examples:
Ensure that you have access to a Kubernetes cluster with Kubeflow Training
control plane installed. If it is not set up yet, follow
[the installation guide](/docs/components/training/operator-guides/installation) to quickly deploy
Kubeflow Training on a local Kind cluster.

- The Training Operator control plane [installed](/docs/components/training/installation/#installing-the-control-plane).
- The Training Python SDK [installed](/docs/components/training/installation/#installing-the-python-sdk).
### Installing the Kubeflow Python SDK

## Getting Started with PyTorchJob
Install the Kubeflow Python SDK to interact with Kubeflow Training APIs:

You can create your first Training Operator distributed PyTorchJob using the Python SDK. Define the
training function that implements end-to-end model training. Each Worker will execute this
function on the appropriate Kubernetes Pod. Usually, this function contains logic to
download dataset, create model, and train the model.

The Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob
worker to perform [PyTorch Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).

If you install the Training Operator as part of the Kubeflow Platform, you can open a new
[Kubeflow Notebook](/docs/components/notebooks/quickstart-guide/) to run this script. If you
install the Training Operator standalone, make sure that you
[configure local `kubeconfig`](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#programmatic-access-to-the-api)
to access your Kubernetes cluster where you installed the Training Operator.

```python
def train_func():
import torch
import torch.nn.functional as F
from torch.utils.data import DistributedSampler
from torchvision import datasets, transforms
import torch.distributed as dist

# [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
dist.init_process_group(backend="nccl")
Distributor = torch.nn.parallel.DistributedDataParallel
local_rank = int(os.getenv("LOCAL_RANK", 0))
print(
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
dist.get_world_size(),
dist.get_rank(),
local_rank,
)
)

# [2] Create PyTorch CNN Model.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
self.fc2 = torch.nn.Linear(500, 10)

def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4 * 4 * 50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)

# [3] Attach model to the correct GPU device and distributor.
device = torch.device(f"cuda:{local_rank}")
model = Net().to(device)
model = Distributor(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

# [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
dataset = datasets.FashionMNIST(
"./data",
download=True,
train=True,
transform=transforms.Compose([transforms.ToTensor()]),
)
train_loader = torch.utils.data.DataLoader(
dataset=dataset,
batch_size=128,
sampler=DistributedSampler(dataset),
)

# [5] Start model Training.
for epoch in range(3):
for batch_idx, (data, target) in enumerate(train_loader):
# Attach Tensors to the device.
data = data.to(device)
target = target.to(device)

optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0 and dist.get_rank() == 0:
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch,
batch_idx * len(data),
len(train_loader.dataset),
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)


from kubeflow.training import TrainingClient

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
TrainingClient().create_job(
name="pytorch-ddp",
train_func=train_func,
num_procs_per_worker="auto",
num_workers=3,
resources_per_worker={"gpu": "1"},
)
```bash
pip install kubeflow
```

## Getting Started with TFJob

Similar to the PyTorchJob example, you can use the Python SDK to create your first distributed
TensorFlow job. Run the following script to create TFJob with pre-created Docker image:
`docker.io/kubeflow/tf-mnist-with-summaries:latest` that contains
[distributed TensorFlow code](https://github.com/kubeflow/training-operator/tree/e6b4300f9dfebb5c2a3269641c828add367688ee/examples/tensorflow/mnist_with_summaries):

```python
from kubeflow.training import TrainingClient
Alternatively, you can install the latest Kubeflow Python SDK version directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate "you can" to reduce wordiness.

from the source repository:

TrainingClient().create_job(
name="tensorflow-dist",
job_kind="TFJob",
base_image="docker.io/kubeflow/tf-mnist-with-summaries:latest",
num_workers=3,
)
```bash
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
```

Run the following API to get logs from your TFJob:

```python
TrainingClient().get_job_logs(
name="tensorflow-dist",
job_kind="TFJob",
follow=True,
)
```

## Next steps

- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.
## Getting Started with PyTorch

- Learn more about [the PyTorchJob APIs](/docs/components/training/user-guides/pytorch/).
TODO (andreyvelich): Add example from the Notebook
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions content/en/docs/components/training/legacy-v1/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
+++
title = "Legacy (v1)"
description = "Kubeflow Training V1 Documentation"
weight = 999
+++

{{% alert title="Old Version" color="warning" %}}
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style guides typically recommend against using "please".


Please follow [this guide for migrating to Kubeflow Training V2](/docs/components/training/operator-guides/migration)
{{% /alert %}}
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ share your experience using the [#kubeflow-training Slack channel](/docs/about/c
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
{{% /alert %}}

This page explains how the [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
This page explains how the [Training Operator fine-tuning API](/docs/components/training/legacy-v1/user-guides/fine-tuning)
fits into the Kubeflow ecosystem.

In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI),
Expand Down Expand Up @@ -60,4 +60,4 @@ Different user personas can benefit from this feature:

## Next Steps

- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).
- Understand [the architecture behind `train` API](/docs/components/training/legacy-v1/reference/fine-tuning).
Loading