[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? #117

gaocegege · 2020-08-11T02:44:50Z

I think we share the similar scope between pytorch-operator and elasticjob-operator. I am wondering if we can collaborate to support PyTorch and PyTorch elastic well.

kiukchung · 2020-08-11T21:32:38Z

Hi @gaocegege thanks for bringing this up! We've been looking into adding support for TorchElastic parameters to the existing Kubeflow pytorch operators and are planning to bundle TorchElastic in the pytorch docker image for the pytorch 1.7 release - which shouldn't be a blocker for the Kubeflow work but certainly makes things smoother for users.

We welcome collaborations! and can certainly use some help from someone who is knowledgeable with Kubeflow in making pytorch operators support elastic parameters.

gaocegege · 2020-08-12T01:39:43Z

@kiukchung Thanks for your reply! We are glad to support it in PyTorch operator, and we can help implement or propose it in Kubeflow community.

gaocegege · 2020-08-12T01:42:54Z

/cc @johnugeorge @andreyvelich @kuikuikuizzZ

kiukchung · 2020-08-12T03:11:50Z

That would be awesome! I think we can approach this from both sides. You from Kubeflow side and us from PyTorch/TorchElastic side.

We can help from torchelastic and pytorch side and review/expedite relevant PRs and prioritize any changes required on TorchElastic or PyTorch (in general Facebook side) user-facing APIs and documentation to make the work on your end smoother (e.g bundling torchelastic in the pytorch docker image for instance). Let us know if you have any specific asks.

Looking forward to seeing elastic support added to Kubeflow pytorch operators!

cc) @drdarshan, @chauhang, @Tierex, @jspisak

gaocegege · 2020-08-12T03:27:13Z

Yeah, we are going to have some action items these days and maybe we will have some specific asks when we are ready. Thanks for your help!

jspisak · 2020-08-12T03:30:45Z

+1 @gaocegege - thanks for reaching out on this! KubeFlow is a high priority platform for us so anything we can do to make it easier for users to leverage advanced features would be great, especially if we can collaborate to make it happen..

cc @chauhang

Jeffwan · 2020-08-13T17:45:02Z

@gaocegege @kiukchung I can write a proposal on it. It needs some interface changes. We can have some discussion by end of month.

kiukchung · 2020-08-13T23:11:03Z

sounds good @Jeffwan, looking forward to the proposal! Keeping this issue open to continue the conversation on the same thread...

gaocegege · 2020-08-14T02:29:01Z

@Jeffwan Thanks! Looking forward to it.

gaocegege · 2020-09-16T06:55:02Z

ping @Jeffwan

Any progress for the issue?

kiukchung · 2020-10-03T05:29:38Z

@gaocegege did you have some thoughts on how we can merge elasticjob-operator with pytorch-operator? FWIW torchelastic is being bundled with the PT docker (https://github.com/pytorch/pytorch/blob/master/Dockerfile#L55) in the next PT release so it should make things simpler to add an option to the existing pytorch-operator to wrap the entrypoint with python -m torchelastic.distributed.launch and set some launch args via env vars (I'm guessing this is why @kuikuikuizzZ asked for this feature in #128 (I've got an open PR for this, waiting for someone in my team to review it).

cc) @jspisak , @chauhang

gaocegege · 2020-10-04T01:41:44Z

cc @Jeffwan

Jeffwan · 2020-10-08T18:30:11Z

FWIW torchelastic is being bundled with the PT docker (https://github.com/pytorch/pytorch/blob/master/Dockerfile#L55) in the next PT release

That's a good news. operator will be only used for orchestration, it's user's responsibility to wrap entrypoint. I think operator side we can determine if that's a torch elastic job by checking args or envs. common logic will be similar, adding/removing pods will be allowed for elastic job. (it's not allowed for general torch jobs).

It's a little bit delayed, I was busy with testing-infra issue in kubeflow org. Currently presubmit jobs blocks all the PRs, it will takes some time to get all issues fixed.

What's the release schedule for torchelastic.0.2.1 and the pytorch version with torchelastic?

kiukchung · 2020-12-15T20:30:11Z

sorry for the late reply.

@Jeffwan -

torchelastic 0.2.1 has been released as of 10/05/2020 (https://pypi.org/project/torchelastic/#history)
As of PT 1.7 torchelastic 0.2.1+ is bundled with the main PT docker

what do you recommend are the next steps? cc) @chauhang , @jspisak, @BorisValkov5

gaocegege · 2020-12-23T10:32:28Z

Awesome! I think it LGTM:

add an option to the existing pytorch-operator to wrap the entrypoint with python -m torchelastic.distributed.launch and set some launch args via env vars

gaocegege · 2020-12-28T16:43:48Z

I looked through the code base of elastic and elasticjob operator, and propose such design in pytorch-operator:

// PyTorchJobSpec is a desired state description of the PyTorchJob.
type PyTorchJobSpec struct {
+	// A switch to enable dynamic worker
+	EnableDynamicWorker *bool `json:"enableDynamicWorker,omitempty"`

+	MinReplicas *int32 `json:"minReplicas,omitempty"`
+	MaxReplicas *int32 `json:"maxReplicas,omitempty"`

	// Specifies the duration (in seconds) since startTime during which the job can remain active
	// before it is terminated. Must be a positive integer.
	// This setting applies only to pods where restartPolicy is OnFailure or Always.
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Number of retries before marking this job as failed.
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

	// Defines the policy for cleaning up pods after the PyTorchJob completes.
	// Defaults to None.
	CleanPodPolicy *common.CleanPodPolicy `json:"cleanPodPolicy,omitempty"`

	// Defines the TTL for cleaning up finished PyTorchJobs (temporary
	// before Kubernetes adds the cleanup controller).
	// It may take extra ReconcilePeriod seconds for the cleanup, since
	// reconcile gets called periodically.
	// Defaults to infinite.
	TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`

	// A map of PyTorchReplicaType (type) to ReplicaSpec (value). Specifies the PyTorch cluster configuration.
	// For example,
	//   {
	//     "Master": PyTorchReplicaSpec,
	//     "Worker": PyTorchReplicaSpec,
	//   }
	PyTorchReplicaSpecs map[PyTorchReplicaType]*common.ReplicaSpec `json:"pytorchReplicaSpecs"`
}

The switch EnableDynamicWorker is similar to https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L67, but maybe we could have a declarative API DynamicWorker. It is used to decide if we should inject the corresponding entrypoint and args.

The other two fields MinReplicas and MaxReplicas are similar to the defs in elasticjob.

The only question is if we should add these fields and logic in PyTorchJob v1 or v2alpha1? /cc @andreyvelich @johnugeorge @Jeffwan

andreyvelich · 2021-01-04T11:15:10Z

Thanks @gaocegege.
Should we follow the same way as TFJob instead of using pointer *bool?

EnableDynamicWorker bool `json:"enableDynamicWorker,omitempty"`

The only question is if we should add these fields and logic in PyTorchJob v1 or v2alpha1?

From my perspective, all new APIs should follow the Kubernetes version policy structure.
With this process: alpha->beta->stable.

Currently, we have only v1 version of API: https://github.com/kubeflow/pytorch-operator/tree/master/pkg/apis/pytorch.
What do you think @gaocegege @johnugeorge ?

gaocegege · 2021-01-04T12:14:58Z

I think it should be a pointer since it can be nil.

kiukchung · 2021-01-08T18:58:27Z

cc) @drdarshan

gaocegege · 2021-08-04T08:55:38Z

Elastic is merged into PyTorch 1.9 and it is released, I think we can support it in pytorch-operator now.

gaocegege · 2021-11-12T07:13:46Z

FYI, there's a proposal kubeflow/community#522 to support Torch Elastic in Kubeflow. And we implemented the MVP here kubeflow/training-operator#1453

It is tested with the echo and imagenet example (with some changes to support CPU training). C10D and ETCD are both supported and tested.

worker-0 is not fault-tolerant with c10d, but it is static and will not re-rank to another new worker. Workers are all fault-tolerant and elastic with etcd backend. There were some issues but already fixed in the upstream master branch:

It should work well if there is a new release involving the patches.

If you are interested in it, please comment in the PR. Thanks!

cc @Jeffwan

johnugeorge · 2022-12-22T07:42:50Z

HPA support has been added to Pytorch Elastic for Training operator in kubeflow/training-operator#1701

Jeffwan mentioned this issue Sep 1, 2020

Support Torch Elastic in pytorch operator kubeflow/pytorch-operator#296

Open

kiukchung mentioned this issue Oct 5, 2020

How to programmatically determine if a training job has finished using kubectl? #130

Closed

kiukchung mentioned this issue Jan 15, 2021

[RFC] Upstream TorchElastic to PyTorch pytorch/pytorch#50621

Closed

gaocegege mentioned this issue Aug 4, 2021

[feat] Support PyTorch 1.9 kubeflow/pytorch-operator#346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? #117

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? #117

gaocegege commented Aug 11, 2020

kiukchung commented Aug 11, 2020

gaocegege commented Aug 12, 2020

gaocegege commented Aug 12, 2020

kiukchung commented Aug 12, 2020 •

edited

Loading

gaocegege commented Aug 12, 2020

jspisak commented Aug 12, 2020

Jeffwan commented Aug 13, 2020

kiukchung commented Aug 13, 2020 •

edited

Loading

gaocegege commented Aug 14, 2020

gaocegege commented Sep 16, 2020

kiukchung commented Oct 3, 2020 •

edited

Loading

gaocegege commented Oct 4, 2020

Jeffwan commented Oct 8, 2020 •

edited

Loading

kiukchung commented Dec 15, 2020

gaocegege commented Dec 23, 2020

gaocegege commented Dec 28, 2020

andreyvelich commented Jan 4, 2021

gaocegege commented Jan 4, 2021

kiukchung commented Jan 8, 2021

gaocegege commented Aug 4, 2021

gaocegege commented Nov 12, 2021 •

edited

Loading

johnugeorge commented Dec 22, 2022

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? #117

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? #117

Comments

gaocegege commented Aug 11, 2020

kiukchung commented Aug 11, 2020

gaocegege commented Aug 12, 2020

gaocegege commented Aug 12, 2020

kiukchung commented Aug 12, 2020 • edited Loading

gaocegege commented Aug 12, 2020

jspisak commented Aug 12, 2020

Jeffwan commented Aug 13, 2020

kiukchung commented Aug 13, 2020 • edited Loading

gaocegege commented Aug 14, 2020

gaocegege commented Sep 16, 2020

kiukchung commented Oct 3, 2020 • edited Loading

gaocegege commented Oct 4, 2020

Jeffwan commented Oct 8, 2020 • edited Loading

kiukchung commented Dec 15, 2020

gaocegege commented Dec 23, 2020

gaocegege commented Dec 28, 2020

andreyvelich commented Jan 4, 2021

gaocegege commented Jan 4, 2021

kiukchung commented Jan 8, 2021

gaocegege commented Aug 4, 2021

gaocegege commented Nov 12, 2021 • edited Loading

johnugeorge commented Dec 22, 2022

kiukchung commented Aug 12, 2020 •

edited

Loading

kiukchung commented Aug 13, 2020 •

edited

Loading

kiukchung commented Oct 3, 2020 •

edited

Loading

Jeffwan commented Oct 8, 2020 •

edited

Loading

gaocegege commented Nov 12, 2021 •

edited

Loading