Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Support Torch Elastic in pytorch operator #296

Open
Jeffwan opened this issue Sep 1, 2020 · 2 comments
Open

Support Torch Elastic in pytorch operator #296

Jeffwan opened this issue Sep 1, 2020 · 2 comments
Assignees

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Sep 1, 2020

TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.

Use cases:

  • Fault Tolerance: jobs that run on infrastructure where nodes get replaced frequently, either due to flaky hardware or by design. Or mission critical production grade jobs that need to be run with resilience to failures.

  • Dynamic Capacity Management: jobs that run on leased capacity that can be taken away at any time (e.g. AWS spot instances) or shared pools where the pool size can change dynamically based on demand.

We want to bring this feature to pytorch-operator. I was working on https://github.com/pytorch/elastic/tree/master/kubernetes and create a dedicate operator for this. I think we discuss this feature in pytorch/elastic#117. This issue is to track this engineer work to add elastic support.

@Jeffwan Jeffwan self-assigned this Sep 1, 2020
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@Jeffwan
Copy link
Member Author

Jeffwan commented Oct 8, 2020

It's blocked on the testing infra now. If it can be not resolved in one week. I will pause tests on this repo and move forward development work

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant