Support Torch Elastic in pytorch operator #296

Jeffwan · 2020-09-01T17:01:07Z

TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.

Use cases:

Fault Tolerance: jobs that run on infrastructure where nodes get replaced frequently, either due to flaky hardware or by design. Or mission critical production grade jobs that need to be run with resilience to failures.
Dynamic Capacity Management: jobs that run on leased capacity that can be taken away at any time (e.g. AWS spot instances) or shared pools where the pool size can change dynamically based on demand.

We want to bring this feature to pytorch-operator. I was working on https://github.com/pytorch/elastic/tree/master/kubernetes and create a dedicate operator for this. I think we discuss this feature in pytorch/elastic#117. This issue is to track this engineer work to add elastic support.

issue-label-bot · 2020-09-01T17:01:14Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Jeffwan · 2020-10-08T18:04:21Z

It's blocked on the testing infra now. If it can be not resolved in one week. I will pause tests on this repo and move forward development work

Jeffwan self-assigned this Sep 1, 2020

issue-label-bot bot added the kind/feature label Sep 1, 2020

Jeffwan mentioned this issue Sep 1, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

Provide feedback