-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Training: Initial Documentation for Kubeflow Trainer V2 #3958
base: master
Are you sure you want to change the base?
Changes from 10 commits
adc2950
d031813
8256b10
212d6da
c8d5eff
2e86716
b0d6844
09ef82c
e92f217
8a5a693
127c5e1
0c3a366
9e3d4d0
694872c
be95550
f2afda3
ec946bc
dafbb3b
34532f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
+++ | ||
title = "Training Operator" | ||
description = "Documentation for Kubeflow Training Operator" | ||
weight = 70 | ||
title = "Kubeflow Training" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about calling it "Kubeflow Trainer"? 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this could be one of the options to name this project Kubeflow Trainer/KFTrainer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would follow HuggingFace's and Lightning.ai's, "trainer" convention: https://huggingface.co/docs/transformers/en/main_classes/trainer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we have these: CRD Names:
SDK APIs: from kubeflow.trainer import TrainerClient, Trainer
TrainerClient().train(
Trainer(
func=train_func,
),
num_nodes=5,
runtime_ref="torch-distributed"
)
# For LLMs
from kubeflow.trainer import TrainerClient, Trainer, FineTuningConfig, LoraConfig
TrainerClient().train(
Trainer(
fine_tuning_config=FineTuningConfig(
peft_config=LoraConfig(
r=4,
)
),
num_nodes=5,
runtime_ref="llama-3.2-8b",
) What do we think about these names ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @astefanutti @kubeflow/wg-training-leads @Electronic-Waste @deepanker13 @saileshd1402 @seanlaii @kannon92 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich I don't like that Kubeflow platform combines the products together? Tell me more! I love the idea of bringing components together but also love the idea of installing them separately. If I remember correctly, there is someone dying on that hill :-). That being said, we still don't have conformance.. we still don't have a unified definition of Kubeflow..so it might be a bit early to rename anything tbh. It feels like procrastination chores when we've got big fish to fry here. Also,
Why not? What are you optimizing for and based on what demand? Have we opened this up to the greater community? Who are our "customers" so to speak and how would they want APIs/Components labeled? This seems like a big decision to make in a vacuum and thought leadership is service (Kelsey has for sure played this game well). What if we opened it up to the greater community? Posted some options on poles via socials we as a community publish? We can then be more data driven. I bet @StefanoFioravanzo has some opinions! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One concern to me is that training operator is not limited to training use case but can be expanded to parallel computing (via MPIJob). Similarly, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My points don't imply that Kubeflow products cannot function as standalone applications. However, for users interested in seeing how these individual open-source projects integrate seamlessly, the Kubeflow Platform provides a comprehensive, end-to-end machine learning experience.
I believe it will take our community significantly more time to discuss this thoroughly (approximately 1–2 years given the user base of Kubeflow Pipelines).
That is correct, but right now it is out of scope of supported CRDs (TrainJob and TrainingRuntime). Theoretically, users can leverage TrainingRuntimes and TrainJob for distributed inference with MPI, but it would be better if we create dedicated CRDs for it. Also, our All in all, naming this project
from kubeflow.trainer import TrainerClient
# Get available runtimes.
TrainerClient().list_runtimes()
# Train my ML model
TrainerClient().train(
runtime_ref="torch-distributed",
trainer=Trainer(
func=train_func,
func_args={"lr": 0.01},
num_nodes=100,
resources_per_node={"gpu": 5},
),
) We can always revisit the project name in the future if users tell us that this experience is bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would support this naming convention, it would be more clear to users than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok with But I agree with here discussions since Note that Previously, we were seeking the Kubeflow |
||
description = "Documentation for Kubeflow Training" | ||
weight = 20 | ||
+++ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
+++ | ||
title = "Contributor Guides" | ||
description = "Documentation for Kubeflow Training contributors" | ||
weight = 60 | ||
+++ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
+++ | ||
title = "Contributing Guide" | ||
description = "How to contribute to Kubeflow Training project" | ||
weight = 10 | ||
+++ | ||
|
||
This document describes how to contribute to Kubeflow Training project. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see inconsistent use of "This document". "This guide", "This page" throughout this PR. We can just start the sentence with something like "To contribute to the Kubeflow Training project... " or something similar. Just a suggestion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, historically we've been using this in the beginning of the page:
However, some pages have various messages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich based on previous conversation, we will not be having a specific page for the contribution/community. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean @varodrig ? Even right now, individual Kubeflow projects have their own contributor guides:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I created an issue to reflect the conversation we had and made a few updates, feel free to make any suggestions |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,158 +1,34 @@ | ||
+++ | ||
title = "Getting Started" | ||
description = "Get started with the Training Operator" | ||
description = "Get Started with Kubeflow Training" | ||
weight = 30 | ||
+++ | ||
|
||
This guide describes how to get started with the Training Operator and run a few simple examples. | ||
This guide describes how to get started with Kubeflow Training and run distributed training | ||
with PyTorch. | ||
|
||
## Prerequisites | ||
|
||
You need to install the following components to run examples: | ||
Ensure that you have access to a Kubernetes cluster with Kubeflow Training | ||
control plane installed. If it is not set up yet, follow | ||
[the installation guide](/docs/components/training/operator-guides/installation) to quickly deploy | ||
Kubeflow Training on a local Kind cluster. | ||
|
||
- The Training Operator control plane [installed](/docs/components/training/installation/#installing-the-control-plane). | ||
- The Training Python SDK [installed](/docs/components/training/installation/#installing-the-python-sdk). | ||
### Installing the Kubeflow Python SDK | ||
|
||
## Getting Started with PyTorchJob | ||
Install the Kubeflow Python SDK to interact with Kubeflow Training APIs: | ||
|
||
You can create your first Training Operator distributed PyTorchJob using the Python SDK. Define the | ||
training function that implements end-to-end model training. Each Worker will execute this | ||
function on the appropriate Kubernetes Pod. Usually, this function contains logic to | ||
download dataset, create model, and train the model. | ||
|
||
The Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob | ||
worker to perform [PyTorch Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). | ||
|
||
If you install the Training Operator as part of the Kubeflow Platform, you can open a new | ||
[Kubeflow Notebook](/docs/components/notebooks/quickstart-guide/) to run this script. If you | ||
install the Training Operator standalone, make sure that you | ||
[configure local `kubeconfig`](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#programmatic-access-to-the-api) | ||
to access your Kubernetes cluster where you installed the Training Operator. | ||
|
||
```python | ||
def train_func(): | ||
import torch | ||
import torch.nn.functional as F | ||
from torch.utils.data import DistributedSampler | ||
from torchvision import datasets, transforms | ||
import torch.distributed as dist | ||
|
||
# [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator. | ||
dist.init_process_group(backend="nccl") | ||
Distributor = torch.nn.parallel.DistributedDataParallel | ||
local_rank = int(os.getenv("LOCAL_RANK", 0)) | ||
print( | ||
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format( | ||
dist.get_world_size(), | ||
dist.get_rank(), | ||
local_rank, | ||
) | ||
) | ||
|
||
# [2] Create PyTorch CNN Model. | ||
class Net(torch.nn.Module): | ||
def __init__(self): | ||
super(Net, self).__init__() | ||
self.conv1 = torch.nn.Conv2d(1, 20, 5, 1) | ||
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1) | ||
self.fc1 = torch.nn.Linear(4 * 4 * 50, 500) | ||
self.fc2 = torch.nn.Linear(500, 10) | ||
|
||
def forward(self, x): | ||
x = F.relu(self.conv1(x)) | ||
x = F.max_pool2d(x, 2, 2) | ||
x = F.relu(self.conv2(x)) | ||
x = F.max_pool2d(x, 2, 2) | ||
x = x.view(-1, 4 * 4 * 50) | ||
x = F.relu(self.fc1(x)) | ||
x = self.fc2(x) | ||
return F.log_softmax(x, dim=1) | ||
|
||
# [3] Attach model to the correct GPU device and distributor. | ||
device = torch.device(f"cuda:{local_rank}") | ||
model = Net().to(device) | ||
model = Distributor(model) | ||
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5) | ||
|
||
# [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers. | ||
dataset = datasets.FashionMNIST( | ||
"./data", | ||
download=True, | ||
train=True, | ||
transform=transforms.Compose([transforms.ToTensor()]), | ||
) | ||
train_loader = torch.utils.data.DataLoader( | ||
dataset=dataset, | ||
batch_size=128, | ||
sampler=DistributedSampler(dataset), | ||
) | ||
|
||
# [5] Start model Training. | ||
for epoch in range(3): | ||
for batch_idx, (data, target) in enumerate(train_loader): | ||
# Attach Tensors to the device. | ||
data = data.to(device) | ||
target = target.to(device) | ||
|
||
optimizer.zero_grad() | ||
output = model(data) | ||
loss = F.nll_loss(output, target) | ||
loss.backward() | ||
optimizer.step() | ||
if batch_idx % 10 == 0 and dist.get_rank() == 0: | ||
print( | ||
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format( | ||
epoch, | ||
batch_idx * len(data), | ||
len(train_loader.dataset), | ||
100.0 * batch_idx / len(train_loader), | ||
loss.item(), | ||
) | ||
) | ||
|
||
|
||
from kubeflow.training import TrainingClient | ||
|
||
# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job). | ||
TrainingClient().create_job( | ||
name="pytorch-ddp", | ||
train_func=train_func, | ||
num_procs_per_worker="auto", | ||
num_workers=3, | ||
resources_per_worker={"gpu": "1"}, | ||
) | ||
```bash | ||
pip install kubeflow | ||
``` | ||
|
||
## Getting Started with TFJob | ||
|
||
Similar to the PyTorchJob example, you can use the Python SDK to create your first distributed | ||
TensorFlow job. Run the following script to create TFJob with pre-created Docker image: | ||
`docker.io/kubeflow/tf-mnist-with-summaries:latest` that contains | ||
[distributed TensorFlow code](https://github.com/kubeflow/training-operator/tree/e6b4300f9dfebb5c2a3269641c828add367688ee/examples/tensorflow/mnist_with_summaries): | ||
|
||
```python | ||
from kubeflow.training import TrainingClient | ||
Alternatively, you can install the latest Kubeflow Python SDK version directly | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can eliminate "you can" to reduce wordiness. |
||
from the source repository: | ||
|
||
TrainingClient().create_job( | ||
name="tensorflow-dist", | ||
job_kind="TFJob", | ||
base_image="docker.io/kubeflow/tf-mnist-with-summaries:latest", | ||
num_workers=3, | ||
) | ||
```bash | ||
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2 | ||
``` | ||
|
||
Run the following API to get logs from your TFJob: | ||
|
||
```python | ||
TrainingClient().get_job_logs( | ||
name="tensorflow-dist", | ||
job_kind="TFJob", | ||
follow=True, | ||
) | ||
``` | ||
|
||
## Next steps | ||
|
||
- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK. | ||
## Getting Started with PyTorch | ||
|
||
- Learn more about [the PyTorchJob APIs](/docs/components/training/user-guides/pytorch/). | ||
TODO (andreyvelich): Add example from the Notebook |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
+++ | ||
title = "Legacy (v1)" | ||
description = "Kubeflow Training V1 Documentation" | ||
weight = 999 | ||
+++ | ||
|
||
{{% alert title="Old Version" color="warning" %}} | ||
This page is about **Kubeflow Training V1**, please see the [V2 documentation](/docs/components/training) for the latest information. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Style guides typically recommend against using "please". |
||
|
||
Please follow [this guide for migrating to Kubeflow Training V2](/docs/components/training/operator-guides/migration) | ||
{{% /alert %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a comma between "template" and "check"?