Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update getting-started.md #3950

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/en/docs/components/training/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def train_func():
from torch.utils.data import DistributedSampler
from torchvision import datasets, transforms
import torch.distributed as dist
import os

# [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
dist.init_process_group(backend="nccl")
Expand Down Expand Up @@ -116,8 +117,7 @@ from kubeflow.training import TrainingClient
# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
TrainingClient().create_job(
name="pytorch-ddp",
train_func=train_func,
num_procs_per_worker="auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to keep it since this value indicates that we should use torchrun as an entrypoint: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L479

Copy link
Author

@ronaldpetty ronaldpetty Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deleted my (cluster) environment, but will try to recreate. From what I saw "num_procs_per_worker" was removed (failed to run). I also loaded locally in a REPL and looked at the help and it was missing there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronaldpetty Can you check version of SDK that you are using ?

pip show kubeflow-training

train_func=train_func,
num_workers=3,
resources_per_worker={"gpu": "1"},
)
Expand Down