feat: add KubeflowExecutor for Kubeflow Training Operator (PyTorchJob + TrainJob) by ko3n1g · Pull Request #462 · NVIDIA-NeMo/Run

ko3n1g · 2026-03-12T16:16:00Z

Summary

Adds KubeflowExecutor that submits distributed training jobs to any Kubernetes cluster running the Kubeflow Training Operator
Supports both PyTorchJob (Training Operator v1) and TrainJob (Training Operator v2) via a job_kind toggle
Pairs with a TorchX scheduler so jobs integrate with run.run() and run.Experiment
Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback)

PyTorchJob vs TrainJob

	PyTorchJob	TrainJob
API	`kubeflow.org/v1`	`trainer.kubeflow.org/v1alpha1`
Pod config	directly in replica pod spec	`podTemplateOverrides[].spec`
`nproc`	`spec.nprocPerNode`	`spec.trainer.numProcPerNode`

Notable fields

tolerations, affinity — go into pod spec / podTemplateOverrides automatically
env_list — full env var dicts supporting valueFrom / secretKeyRef
pod_spec_overrides — arbitrary extra pod spec fields (e.g. resourceClaims for IMEX channels)
launch(wait=True) — polls until RUNNING / SUCCEEDED / FAILED
cancel(wait=True) — polls until CR gone and all pods terminated
UNKNOWN/None status → AppState.PENDING (avoids false failures on transient API errors)

Test plan

63 unit tests passing (pytest test/core/execution/test_kubeflow.py test/run/torchx_backend/schedulers/test_kubeflow.py)
PyTorchJob e2e verified against AWS EKS (local/example.py): launch → RUNNING → log sentinel → cancel(wait=True)
TrainJob e2e pending GKE cluster readiness (local/example_trainjob.py)

🤖 Generated with Claude Code

nemo_run/run/torchx_backend/schedulers/kubeflow.py

Introduces KubeflowExecutor and a matching TorchX scheduler so users can deploy distributed training jobs to any Kubernetes cluster running the Kubeflow Training Operator via run.run() / run.Experiment. Supported job kinds (toggled via job_kind field): - PyTorchJob (Training Operator v1, kubeflow.org/v1) - TrainJob (Training Operator v2, trainer.kubeflow.org/v1alpha1) Key features: - Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback) - PyTorchJob: builds Master + Worker replica specs with nprocPerNode - TrainJob: builds spec.trainer + merges all pod-level config (volumes, tolerations, affinity, imagePullSecrets, resourceClaims, etc.) into a single podTemplateOverrides entry targeting "node" - env_list field supports full env var dicts (valueFrom / secretKeyRef) - pod_spec_overrides merges arbitrary extra fields into the pod spec - launch(wait=True) polls until RUNNING / SUCCEEDED / FAILED - cancel(wait=True) polls until CR is gone and all pods are terminated - TorchX scheduler persists job state in ~/.nemo_run/.kubeflow_jobs.json and maps KubeflowJobState → AppState (UNKNOWN/None → PENDING to avoid false failures on transient API errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

63 tests covering: - Executor: defaults, kubeconfig fallback, nnodes, nproc_per_node resolution, assign, manifest generation for PyTorchJob and TrainJob (structure, resources, volumes, env_vars, env_list, labels, image_pull_secrets, tolerations, affinity, pod_spec_overrides, spec_kwargs, container_kwargs), launch (success, wait, timeout, conflict), status (all states + API errors), cancel (plain, 404, wait=True, wait timeout), fetch_logs (no-follow, follow, TrainJob label selector) - Scheduler: create, dryrun, schedule, describe (all states + UNKNOWN→PENDING regression), cancel, log_iter (list + str), persistence (new file, merge, missing file), state map Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Introduces KubeflowExecutor and a matching TorchX scheduler so users can deploy distributed training jobs to any Kubernetes cluster running the Kubeflow Training Operator via run.run() / run.Experiment. Supported job kinds (toggled via job_kind field): - PyTorchJob (Training Operator v1, kubeflow.org/v1) - TrainJob (Training Operator v2, trainer.kubeflow.org/v1alpha1) Key features: - Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback) - PyTorchJob: builds Master + Worker replica specs with nprocPerNode - TrainJob: builds spec.trainer + merges all pod-level config (volumes, tolerations, affinity, imagePullSecrets, resourceClaims, etc.) into a single podTemplateOverrides entry targeting "node" - env_list field supports full env var dicts (valueFrom / secretKeyRef) - pod_spec_overrides merges arbitrary extra fields into the pod spec - launch(wait=True) polls until RUNNING / SUCCEEDED / FAILED - cancel(wait=True) polls until CR is gone and all pods are terminated - TorchX scheduler persists job state in ~/.nemo_run/.kubeflow_jobs.json and maps KubeflowJobState → AppState (UNKNOWN/None → PENDING to avoid false failures on transient API errors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

- kubernetes import wrapped in try/except; ImportError raised at instantiation time with a helpful install message - New [kubeflow] optional extra in pyproject.toml: pip install nemo-run[kubeflow] - uv.lock updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g had a problem deploying to public March 12, 2026 16:17 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:17 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

nemo_run/run/torchx_backend/schedulers/kubeflow.py Fixed Show fixed Hide fixed

ko3n1g force-pushed the feat/pytorchjob-executor branch from 435b65c to 8f943bc Compare March 12, 2026 16:19

ko3n1g requested review from chtruong814, hemildesai and theothermike March 12, 2026 16:20

ko3n1g had a problem deploying to public March 12, 2026 16:20 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:20 — with GitHub Actions Inactive

ko3n1g force-pushed the feat/pytorchjob-executor branch from 8f943bc to 9b2921d Compare March 12, 2026 16:22

ko3n1g had a problem deploying to public March 12, 2026 16:23 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:23 — with GitHub Actions Inactive

ko3n1g and others added 3 commits March 12, 2026 16:25

ko3n1g force-pushed the feat/pytorchjob-executor branch from 9b2921d to 6f63116 Compare March 12, 2026 16:25

ko3n1g had a problem deploying to public March 12, 2026 16:26 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:27 — with GitHub Actions Inactive

ko3n1g deployed to public March 12, 2026 16:36 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KubeflowExecutor for Kubeflow Training Operator (PyTorchJob + TrainJob)#462

feat: add KubeflowExecutor for Kubeflow Training Operator (PyTorchJob + TrainJob)#462
ko3n1g wants to merge 4 commits intomainfrom
feat/pytorchjob-executor

ko3n1g commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Mar 12, 2026

Summary

PyTorchJob vs TrainJob

Notable fields

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant