You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.
Use cases:
Fault Tolerance: jobs that run on infrastructure where nodes get replaced frequently, either due to flaky hardware or by design. Or mission critical production grade jobs that need to be run with resilience to failures.
Dynamic Capacity Management: jobs that run on leased capacity that can be taken away at any time (e.g. AWS spot instances) or shared pools where the pool size can change dynamically based on demand.
TorchElastic enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner.
Use cases:
Fault Tolerance: jobs that run on infrastructure where nodes get replaced frequently, either due to flaky hardware or by design. Or mission critical production grade jobs that need to be run with resilience to failures.
Dynamic Capacity Management: jobs that run on leased capacity that can be taken away at any time (e.g. AWS spot instances) or shared pools where the pool size can change dynamically based on demand.
We want to bring this feature to pytorch-operator. I was working on https://github.com/pytorch/elastic/tree/master/kubernetes and create a dedicate operator for this. I think we discuss this feature in pytorch/elastic#117. This issue is to track this engineer work to add elastic support.
The text was updated successfully, but these errors were encountered: