Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

WarrenZhu050413 · 2025-05-22T04:46:31Z

TorchFT Examples

This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of train_ddp.py.

The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given train_ddp.py at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.

@d4l3k provided useful feedback in how to structure the examples.

Examples Included:

DDP with Proactive Failure Recovery (examples/ddp_proactive)
- Demonstrates how to enable proactive detection and response to worker failures
- Includes detailed explanation of recovery mechanism with annotated logs
- Shows significant reduction in recovery time compared to timeout-based approaches
DiLoCo (Distributed Local Convergence) (examples/diloco)
- Implements DiLoCo training methodology
- Shows how to configure and optimize local convergence parameters
- Documents performance characteristics and tradeoffs
LocalSGD (examples/localsgd)
- Demonstrates LocalSGD with periodic synchronization strategy
- Provides guidance on setting appropriate synchronization frequency
- Includes performance comparison considerations
Live Checkpoint Recovery (examples/live_checkpoint_recovery)
- Shows how to implement checkpoint-based recovery for fault tolerance
- Documents the checkpoint storage and retrieval process
- Includes recovery time analysis and optimization tips

…ytorch#188)

H-Huang

Wow! This is really awesome, thank you for your contributions. I will take a look more closely soon.

One thing I think would be a good idea is to run these examples as part of CI as well so we know we aren't regressing anything. We might have to create a separate CI workflow for this.

WarrenZhu050413 · 2025-05-24T02:39:40Z

@d4l3k also commented on this. Working on the CI rn!

WarrenZhu050413 · 2025-05-24T05:52:00Z

Added CI.

This is the output when I run examples/test_examples.py locally. The tests are currently CPU only, using torchx.py's default.

For the .yaml file, the environment is set up identically to torchft/.github/workflows/unittest.yaml.

> pytest examples/test_examples.py
======================================== test session starts =========================================
platform linux -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /srv/apps/torchft
configfile: pyproject.toml
plugins: typeguard-2.13.3, timeout-2.3.1, anyio-4.9.0
timeout: 60.0s
timeout method: thread
timeout func_only: False
collected 12 items                                                                                   

examples/test_examples.py ............                                                         [100%]

=================================== 12 passed in 99.86s (0:01:39) ====================================

…Recovery, and proactive failure detection with DDP, along with CI (pytorch#198)

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

7b550aa

…ytorch#188)

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 22, 2025

H-Huang reviewed May 23, 2025

View reviewed changes

WarrenZhu050413 force-pushed the torchft_examples branch 2 times, most recently from 089bcd5 to 478c162 Compare May 24, 2025 05:47

WarrenZhu050413 force-pushed the torchft_examples branch from 478c162 to 02499b4 Compare May 24, 2025 05:59

Added example training scripts for localsgd, DiLoCo, Live Checkpoint …

f5ee704

…Recovery, and proactive failure detection with DDP, along with CI (pytorch#198)

WarrenZhu050413 force-pushed the torchft_examples branch from 02499b4 to f5ee704 Compare May 25, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

WarrenZhu050413 commented May 22, 2025

Uh oh!

H-Huang left a comment

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

Uh oh!

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

Are you sure you want to change the base?

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198) #200

Conversation

WarrenZhu050413 commented May 22, 2025

TorchFT Examples

Examples Included:

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

WarrenZhu050413 commented May 24, 2025

Uh oh!

Uh oh!