Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200
+2,811
−56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TorchFT Examples
This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of
train_ddp.py
.The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given
train_ddp.py
at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.@d4l3k provided useful feedback in how to structure the examples.
Examples Included:
DDP with Proactive Failure Recovery (
examples/ddp_proactive
)DiLoCo (Distributed Local Convergence) (
examples/diloco
)LocalSGD (
examples/localsgd
)Live Checkpoint Recovery (
examples/live_checkpoint_recovery
)